Langs, tech, stuff

Saturday, 5 November 2022

Virtual threads in JVM vs FP, How to keep up with learning, Netflix likes FreeBSD

I had a run through several conference talks quite recently and some of them could be interesting for you.

Virtual threads - while in Scala world we may be deep diving into the beauty of FP styles, Project ~~Fear~~Loom is making its way to prod in JVM, with JDK19 having it as a preview feature. You may start thinking about the need to unlearn current prod-quality coding approaches like thread pools and start questioning what is exactly the value of FP styles (there is still a lot of value - composability and avoiding race conditions), trying to be efficient with few valuable OS threads and event loops, like those in cats-effect IO, powering fs2, while you can be just completely inefficient with imperative style? https://devoxx.be/talk/?id=57405

What could be a next step in your career? How to keep up with the running technological landscape? Some nice hints and tips - https://devoxx.be/talk/?id=2252

For those who love to congest all available network capacity - FreeBSD could be your platform of choice and you probably have used it already it if you watched Netflix. It was also a platform of choice for me ~20 years ago for a very heavy-used ftp/http server, with non-blocking IO and traffic shaping to keep multiple connections alive and not overload uplink https://www.youtube.com/watch?v=36qZYL5RlgY Btw, who knows these days the feeling of dopamine when you discovered and logged in to some new public FTP server - that was like discovering an unknown cave system or a good restaurant with free tables on Friday evening, the true feeling of internet exploration.

Tuesday, 7 April 2020

Bash, fish-shell, Zsh - in the pursuit of command line happiness

The years are going on but the command line is not disappearing. There are few productive tips I found useful for myself.

Shell - bash, fish, zsh, tcsh?

Bash

Bash is a good default. Bash is Turing complete. My one outstanding issue with bash is that is has problems when editing commands longer than single line. I have spent years using bash and I am still not sure about some constructs there. The ifs and ors. The crucial `set -x` There used to be conference talks when people would argument, how it is more productive and maintainable and now even more carbon neutral to write longer scripts in Perl than in Bash. Or Python. Or in Ruby. But people love to challenge them self with longer bash codes. Whatever.

Fish-shell

Fish looks fancy, fish looks smart, fish feels cool. At least for the first moment. I tried to use it, but then I realised it is too incompatible with bash - I was using some AWS commands and I couldn't get the `source` command working together. To the point it was blocking me from doing what I needed to do. It was 2 years ago so maybe something has changed since. Anyway, I swam away from that fish.

Zsh

I have tried it once years ago. The same time I have tried csh (so I could write shell scripts easier). I think I skipped it because I was trying to get my team with some uniformity and bash was looking more like "it is default and it works". Years later I tried it again. And again. I couldn't get used to the way I have one history across all different tabs and windows. I don't think I like it that way but I got used to it. I added oh-my-zsh and powerlevel10k theme. I learned how to search history with regexps. There was still something missing. Then I learned from my colleague Hung (thanks!) I can have a fuzzy search commands history on top of it. That's quite nice. Usable. Nice prompts with current git repo branch and status, commands execution time, current kubernetes cluster.

~/w/dotshellapps  master ?2 ▓▒░ sleep 3
~/w/dotshellapps  master ?2 ▓▒░ ░▒▓ ✔  3s  anaconda3   dev ⎈  21:14:41

I am not sure I need anaconda there. I may fix it later. Yes, the right side is probably too long, but it disappears when you type. So it is not that annoying and my console screen is much wider. Ok, I may fix it one day, I promise.

Wednesday, 21 November 2018

Scala, Play and json - no boilerplate with Circe vs play-json, but is it faster?

If you don't like typing things about your case classes that Scala compiler can do for you, think about json library Circe. You may like to read something about Circe first, like this reddit post, or perhaps just make your mind by jumping to code and see how these things works.

If you would like to try Circe in your project, all you need is play-circe and extend your controller with one more trait.

I run simple benchmarks to see if there is any significant difference, probably these cases are too simple, but I cannot say any of them is faster.

Feel free if you would like to extend these cases, https://github.com/codesurf42/scalaPlayJsonCirce

Monday, 12 December 2016

7 challenges and myths about going microservices

1) This is a cultural change in organisation

True. Microservices are like a mental change how we do software. There are going to be many, smaller projects, a layer of network communication, releasing and versioning issues. If you have a team of people who never wrote any automation scripts, never released their own projects in production environments, never did any CI/CD - it will be really hard. You can expect a lot of resistance, learning curves will vary depending on personal ops experiences, collaboration in team and willingness to learn new, different approach.

2) Microservices are easier to develop

Yes and no. It is easier to develop such services on its own - projects are smaller, it is easier to jump and get the understanding of business logic inside a single project, fix a bug, upgrade libraries. On the other hand - you need a well functioning deployment pipelines and good inter-services communication patterns to get all the things going. The power of microservices is because they operate together - but this is also a big challenge itself, as networks can fail, messages can be dropped, responses could be randomly timing out, testing is different and may feel harder, teams may not communicate well and so on. You need to decide on communication (sync/async), develop patterns for graceful failing, handling errors, startup/initialisation and so on.

3) You need devops culture

Yes, true. For a working solution, every development team should be able to test and deploy. Deployments should be simple and easy to repeat. Think about one-click deployment pipelines. Teams should be cross-functional. Operational excellence and understanding of devops and continuous integration principles comes also from understanding production environments. Monitoring, metrics, logging, tracing requests, dumping messages. Troubleshooting issues at 3am, reverting to previous versions, fixing corrupted data, all these things helps you to develop culture of operational excellence. "Bad behavior arises when you abstract people away from the consequences of their actions"

4) You need to have a good TDD/BDD

Not per se, but it helps a lot. Technically you don't need to have automated tests to prove your software works, but you know it helps a lot to move faster, make important changes safer and it is usually cheaper when you can quickly prove new version doesn't break existing functionality.

5) You need continuous delivery

A simple, working solution to push new changes easily helps a lot. Actually it is more like a cause-and-effect loop - a microservice project is small so it is faster to change it and deploy more frequently than a big, monolithic app. It has to be easy and every developer should be able to release changes.

6) You should (not) use REST

It depends. Communication patterns like Rest tends to be simple and well-known to start with, but with more complex environments, they can be terrible in terms of error-handling, decoupling and fail-over. With synchronous patterns like rest, one service depends on current availability of another, even if the job flow doesn't really require it. That makes e.g. failing gracefully with partial outage much harder. Async communication on the other hand has higher initial cost, it could be also harder to follow such execution paths in the code.

7) You need to use containers like docker

What you really need is to have manageable, versioned, repetitive environments where you can easily deploy your services. You can archive it in many ways - good orchestration tools like ansible/salt/chef/puppet or whatever is hot and tasty today.
While technically it is possible to build images/machines every time, caching some pre-build stages as e.g. docker images and adding only your changed service may speed up this process a lot.
You need a quick release cycle, so if docker helps you to archive it - go for it.

Sunday, 10 April 2016

Graph database Neo4j behind Panama Leaks

A graph database Neo4j was in a centre of toolbox used to power investigative journalism in what seems to be the biggest-ever financial leak history.

"Through its Data & Research unit, ICIJ provided the data analysis expertise to make the documents exploitable by reporters. They extracted the metadata of documents using Apache Solr and Tika, then connected all the information together using the leaked databases, creating a graph of nodes and edges. The data was stored in the Neo4j graph database edited by our partner Neo Technology. The result provides unique insights into the offshore banking world showing the relationships between banks, clients, offshore companies and their lawyers."

https://linkurio.us/panama-papers-how-linkurious-enables-icij-to-investigate-the-massive-mossack-fonseca-leaks/

http://neo4j.com/news/neo4j-powers-panama-papers-investigation/

Sunday, 14 February 2016

Keep your SBT projects' dependencies up to date

There is a simple SBT plugin to check if there are never versions of libraries used by SBT projects.

https://github.com/rtimush/sbt-updates

You may add it to your project (why not?), but you can also keep it separately, especially when you quickly want to check several projects.

Create a file:

~/.sbt/0.13/plugins/sbt-updates.sbt

and add such content there:

addSbtPlugin("com.timushev.sbt" % "sbt-updates" % "0.1.10")

After that you can run

sbt dependencyUpdates

or just type "sbt" and later, in the console, use keyword expansion to save on typing...

> dependencyUpdates
[info] Found 4 dependency updates for minimal-akka-scala-seed
[info]   com.typesafe.akka:akka-actor        : 2.3.11 -> 2.3.14 -> 2.4.1
[info]   com.typesafe.akka:akka-testkit:test : 2.3.11 -> 2.3.14 -> 2.4.1
[info]   org.scala-lang:scala-library        : 2.11.6 -> 2.11.7
[info]   org.scalatest:scalatest:test        : 2.2.4 -> 2.2.6
[success] Total time: 1 s, completed 14-Feb-2016 00:27:55

Friday, 4 September 2015

Highly connected data are better off with Neo4j

Replacing SQL by Neo4j in a new release of project management tool simplifies business logic, adds time-dependent data management, and gives room for new features like connections-in-time thanks to a better expressiveness and ease of visualization of highly-connected data in a graph database.

Problem:

Our client is providing sophisticated tools to manage multiple projects in big organisations. This software stack was going through a lot of transformation and rapid development to cover new functionality, however using relational database started to be more and more of an issue than a help. It is not like SQL databases are wrong, it is not even about SQL - the thing is SQL is delivering a lot of functionality like constrained schema and types which are not essential to such data model, but is lacking simplicity when data is more connected – but this is where our client’s business has value, in exploring, modifying and comparing these connections. Technically, it means more and more data processing is done by application code, in memory, where database is downgraded to simple “permanent serialized storage”.

While this is a usual path when scaling up - taking more and more processing from database and distributing it on application nodes - it also brings a lot of ugliness. The complex data model in application logic is far from simple structures stored in database, it is also hard to validate or cross-check data consistency. Any ad-hoc business analytics needs writing another application features (no simple database queries) – that drives up application complexity and we end up not only with more expensive development of new functionalities, but also with higher bug-fixing/quality/maintenance costs.

Evaluation and rapid prototyping

First phase was to evaluate how such data can be expressed in a better form. At that time there were 2-3 graph database options available on the market (OrientDB and perhaps Titan), but for prototyping we choose only Neo4j. That was because Neo4j was the most mature solution for this type of data and data size, while licensing was Ok, coverage in language drivers/api/doc was also good, with quite large and active community, the used Cypher query language, somehow similar to SQL, but with a lot of functionality for connected data – all that was a promise to go with this project within acceptable risks levels and to be able to develop new features when needed. We already had some previous experience with Neo4j, so it was easy to start with.

For rapid prototyping and SQL to Neo4j conversion, we decided to write ETL in Ruby with Neography. The dynamic nature of this language helped a lot with quick changes on how the data model should be transformed and stored. The goal was to have a flexible tool that can convert existing SQL databases, perhaps with some variations about the specific transformation rules depending on different data sets and customers. Neography is a REST client for Neo4j, well documented and in active development, so it was quick to start and complete with. The speed was not too essential for this task but rather proper data conversions, but still there is a room for improvement like bigger transaction blocks (batches) between commits.

Having data converted to Neo4j, the next step was to start analyzing it with Cypher. Neo4j comes with a really nice graphical data browser and it was getting better and better between releases - it is worth to mention as a command-line shell is far from being enough when dealing not only with nodes, simple to show as an equivalent of table or key-value storage, but also with connections between nodes that can have its own set of properties and directions.

In Cypher - the graph query language used in Neo4j, we describe what to select as a group of nodes and how these nodes are connected. Its crucial part of syntax actually looks simple and quite natural:

(node1)-[:connection]->(other_node)

and could be written in a way similar to SQL:

MATCH (node1)-[:connection]->(other_node)
WHERE node1.name="Foo" and other_node.name="BAR"
RETURN node1, other_node

but can grow to express e.g. what we want to do with returned paths, sum values or aggregate returned nodes.

Time-machine with immutable data

The business data presented in storage is expressing graph of connections and advancement of executed projects for this moment in time. There was already a functionality to keep previous changes of “node” values, but nothing like this existed for connections between them. One of the requests for new design was to cover immutability of the whole data structure - e.g. someone would like to know how the state of the overall project was looking like exactly 24 hours or 3 months ago, who changed what or do some aggregation in periods of time to see how the progress is going on. The answer was to separate mutable changes from immutable “cores”, time-stamping all the changes - where the change could be in typical data row, but also in connection.

The idea is quite simple, as shown in this example graph:

P1 and P2 are “immutable cores of data entities”. P1 got 4 changes, so they are stored as a sequence of states S1-S4. P2 got two changes S1-S2. At some point of time, the properties of relation r1 connecting P1 and P2 has changed, creating another relation r2.

Indexing and constraints

Some of the questions that we needed to consider while modelling new structures were - how are we going to use this data, how are we going to query it, how database can help us to keep it healthy, unique or just quick to find? Neo4j has indexes, that works on data from nodes - they take node label and node property and can be constrained as a pair of unique values - similar to what we know from other databases. There are also “legacy” indexes, a bit lower-level and closer to Lucene engine, but gives us possibility to index practically any property in graph. Just by using proper indexing scheme on our data, even on small data sets we could reduce query times 10-fold. With newer versions of Neo4j, there also comes profiler and query-explain syntax to see, how possibly the query could be optimised by Neo engine and that way we have feedback on how to rebuild queries for the expected data.

Delivering new Scala API

Our software stack is split into separate parts, with UI written in Html5/JS/Angular and backend APIs written in Scala/Play/Akka. The goal was to deliver new backend keeping UI api calls mostly intact. As the data source and “query engine” have drastically changed, it meant also some additional fiddling with returned data to get them in exactly the same way as before. Additional complexity here was coming from the fact that it is was already mimicking previous version with more SOAP-like queries, while one of the sub-goals was to introduce new simple REST calls.

While developing this stack, the helpful part of Scala toolbox were type aliases (to better express already used compound structures), packing simple but specific data values into its own case classes and when the processing model was quickly changing - using implicit conversions between types to minimize impact of changes to the existing code in the first step and deliver new functionality without too much refactoring.

Another story is about Scala standard collections, folds, options for optional/partial data. Also parts of Scalaz library were very helpful by giving simple operations for transforming and merging complex monoid-like data structures. I hope to write another blog post just focusing on these techniques - while such Scalaz-monoid functionalities can be found in libraries in other languages too, implicit type conversions managed in the code are not that common.

The crucial part developed from the very beginning were integration tests written in BDD style, covering quickly growing complexity of the new engine and possible operations. To sum it up from higher perspective, another layer of functional tests were covering http API calls and JSON structures/transformations, with help of Scalatest.

Scala and Neo4j

The Scala-Neo4j space does not give too much room when picking up the best library, even if some parts of Neo4j are written in Scala. Using native Java connectivity feels like too much boilerplate, FaKod library was an interesting option with its own DSL but also forcing to learn “another version of Cypher”, which could lead to a game “how I could possibly write such Cypher statement in my Cypher-like narrowed DSL”, so well-known when using some SQL/ORM-like libraries. The suitable choice seemed to be AnormCypher, allowing to execute any kind of Cypher queries, but also requiring careful quoting and parsing of returned data. It is worth to mention constant awareness of Scala-Java conversions in collection types, as they can be sometimes quite tedious to spot when “leaking” to other parts of code with quite surprising error messages.

AnormCypher actually has one processing drawback, that led to problems with heap - all data read by REST client is costly transformed in-memory before being given to application. For some queries, even when the whole database was around 15MB in size, the query response data could grow 10-fold with another 10-fold to process it. I hope to find some time to help fixing it, as it was quite annoying to see JVM breaking with 3-4 GBs assigned to memory pool, while processing so little in terms of data size.

But processing such data is a game with many goals - there is not so much business value in perfectly valid and complex queries, if the execution and processing time is way below expectations. Actually the solution for heap but also for speed problems was to go with hybrid data model by adding caching layer - some data is taken from Neo4j, but some additional operations are optimised/filled up by using data from cache.

Room for further improvements

There are many places in data model and design, that can be improved further, especially when gathering how the data will be used and where and how quickly it will be growing. Perhaps not all changes should be kept - from time perspective, we are more sensitive about current than historical data, so old changes could be partially flattened in time-periods with the other erased to keep our storage fit.

Overall experience with Neo4j

Neo4j is a quick and easy to start with but complex in its analytical possibilities tool. It can drastically simplify business logic when dealing with more connected data, giving quick tools to “see” and “touch” it. It opens new area to explore existing data, “a higher kind of abstraction”, giving similar feeling when someone compare flat files versus full SQL database with SQL queries - or when we put files modified by team of people into VCS so we can easily manage and see all changes. It is simple to write Cypher queries to prototype and develop new reports, try some new analytical things, get fresh insights. There is a hidden complexity that can sporadically appear – Neo4j approach to graph structure, graph indexing, queries with optional matches or differences between REST or write-your-own server extension. There are also some “grey” places like Neo4j simple backup, that actually creates files that cannot be directly consumed by Neo4j import tool. On the other hand - by easier manipulation, navigation and visualisation, Neo4j adds new value to existing data.

Perhaps it can simplify your project-of-connected-data too?

Wednesday, 8 July 2015

Scala: Unicode, comparing combined characters, normalization

In unicode, you can get some characters by single sequence or by combining it with another character. For example:

In scala:

val c1 = "Idee\u0301"
val c2 = "Ide\u00e9"

c1: String = Ideé
c2: String = Ideé

however

c1 == c2
res0: Boolean = false

so for the rescue, we need to normalize them first:

import java.text.Normalizer
val n1 = Normalizer.normalize(c1, Normalizer.Form.NFC)
val n2 = Normalizer.normalize(c2, Normalizer.Form.NFC)

n1 == n2

res1: Boolean = true

Simple!

Wednesday, 1 April 2015

Scala-Play filter that just dumps everything (body too)

The key thing to get body from stream is a proper Iteratee, as a bonus here is URI-decoding parsing any data.

Sunday, 16 November 2014

Scala/Play with sbt on IBM Bluemix (Heroku-like app cloud / PaaS)

0) IBM Bluemix is an app-cloud, similiar to Heroku, deployed with CloudFoundry on their own cloud infrastructure.

1) It works.

2) Just make a simple shell script that is going to call sbt/activator. Example https://hub.jazz.net/project/dev8661/BluePlayScala/overview
No need for gradle/exec builder wrapper.

3) Do not forget to include manifest.yml to your project. Yes, Bluemix could be more verbose about the fact that this file is missing / it has no idea what is where and how it should be deployed.

3) Check dashboard logs - it doesn't help to starve your scala app with too low memory (mine got ok when I increased mem from 128 to 512, even if 512 was already an upper limit when started with 128).

4) I like pipelines - build/CI infrastructure. This is how you can pack complex tasks into one clean overview.

5) Thanks for all cooperation during last Krk Scala hack-meetup - it was worth to attend and "get something done".

Thursday, 19 June 2014

Parallel XML processing with Scala & Akka - 3x faster!

I wanted to check how easy or hard it is to optimise XML processing with parallel/concurrent approach. I have some experience doing it with e.g. Java/Spring (as much boring as complicated), but I wanted to see how easy it could be to use a JVM Scala and Akka, an actor-based toolkit for that kind of ETL data transformations.

I prepared 2 projects that are doing exactly the same thing - processing Wiki-voyage XML and wiki-markup file and getting some statistics about it.

git clone https://github.com/codesurf42/wikiParser.git
git clone https://github.com/codesurf42/wikiParserLinear.git

The first one is based on Akka actors for concurrent processing and Akka agents for counting data, another one is a reference solution (linear processing) to compare efficiency.
There are also simple metrics counting execution time and number of calls - I found them very useful to check consistency across both solutions too.

Think asynchronously

XML is processed in a StAX way, this is a linear loop over the input xml stream and due to its nature (keeping current state) that loop has to be single.

Actors

The good idea is to keep them simple, one type of job per actor - that way it is simple to tune system by creating pool of actors when the stage is significantly slow than others.

Agents

They are used to properly count single values across actors/threads. They can be very simple like a counter:

Parser.agentCount.send(_ + 1)

(the simple var count "in actor" is actually counting, how many messages a single actor processed).

They can also take a function to compute, like for a longest article title:

 
// we can just compare which one is the longest
Parser.agentMaxArtTitle.send(current =>
  if (current.length < e.title.length)
    e.title
  else
    current
)

Metrics

There are also some in-project metrics gathered by using actors too - it is very important to see where the time is spend and where we have bottlenecks to prioritise our effort on improving it. They can also help to check consistency across different solutions - simply we should get the same numbers.

Typesafe console

This is a nice tool for profiling your Akka app. You can watch real-time numbers of messages processed by actors, latency in queues and more. This project is down-versioned to the latest Typesafe console version - to start it with this monitoring tool, simply use:

sbt atmos:run

Typesafe console is slowing down the runtime and sometimes can go crazy - just to be aware.

Tests

There are some BDD-style tests for parsing wiki markup. Writing regexp in Scala is easier than in Java (eg. better quoting of strings, you can avoid \\n within triple doublequotes etc.), but it is still not that clean and pure experience like in Perl. Shame.

Time gains and latency

It happened that even when using quite complex regexp for parsing article sections it was too fast to see how actors are optimising data flow. So I deliberately added 1ms delay in WikiParising.getSection to show these effects better.

Here are timings from WikiParserLinear (in microseconds, with a number of calls):

Exec time: 245.12 sec
LongestArt: 106 (File:FireShot Screen Capture -002 - 'Bali – Travel guides at Wikivoyage' - en wikivoyage org wiki Bali.jpg), count: 49788, 49788
execTime:
         readXml-parseXml:    245083591,          1
               xmlHasNext:    215568312,    4548859
           parseArticle-2:    119644815,      49788
                seePlaces:    118535075,      49788
                 getGeo-1:       549990,      18939
           parseArticle-1:       434810,      49788
           longestArticle:       245901,      49788
                 getGeo-2:        54741,      30849
         readXml-FromFile:        40245,          1
          seePlacesLength:        31297,      49788
[success] Total time: 265 s, completed 19-Jun-2014 00:29:31

and WikiParser (akka) with default actor settings, without any router configurations:

Exec time: 220.03 sec
LongestArt: 0 (File:FireShot Screen Capture -002 - 'Bali – Travel guides at Wikivoyage' - en wikivoyage org wiki Bali.jpg), count: 49788, 24894
execTime:
         readXml-parseXml:    219988018,          1
                seePlaces:    120414366,      49685
               xmlHasNext:     92934186,    4548859
           longestArticle:      1044310,      49788
           parseArticle-1:       804609,      49788
                 getGeo-1:       535376,      18939
           parseArticle-2:       363918,      49788
                 getGeo-2:        77243,      30849
         readXml-FromFile:        44811,          1
          seePlacesLength:        31267,      49685
       count_parseArticle:            0,      49788
execTime:
         readXml-parseXml:    219988018,          1
                seePlaces:    120704828,      49788
               xmlHasNext:     92934186,    4548859
           longestArticle:      1044310,      49788
           parseArticle-1:       804609,      49788
                 getGeo-1:       535376,      18939
           parseArticle-2:       363918,      49788
                 getGeo-2:        77243,      30849
         readXml-FromFile:        44811,          1
          seePlacesLength:        31297,      49788
       count_parseArticle:            0,      49788

This is already faster, as for example *seePlaces* takes time when *readXml-parseXml* happens.
There are two dumps of counters as they are not exactly the same - there are still some messages being processed, like these in seePlaces (49685 / 49788) or seePlacesLength.

Tunning

Now, because we have some bottlenecks, we can do something about them. Let's just get more workers here:

val seePl = system.actorOf(Props[ArticleSeePlacesParser].withRouter(RoundRobinRouter(3)), "seePlaces")

How it changed timings?

Exec time: 91.51 sec
LongestArt: 0 (File:FireShot Screen Capture -002 - 'Bali – Travel guides at Wikivoyage' - en wikivoyage org wiki Bali.jpg), count: 49788, 49788
execTime:
                seePlaces:    123174283,      49178
         readXml-parseXml:     91491233,          1
               xmlHasNext:     67099980,    4548859
           longestArticle:      2419829,      49788
           parseArticle-2:       647368,      49788
           parseArticle-1:       609389,      49788
                 getGeo-1:       566737,      18939
                 getGeo-2:        80348,      30849
          seePlacesLength:        31106,      49178
         readXml-FromFile:        15315,          1
       count_parseArticle:            0,      49788

This is almost 3x faster that linear version and 2.5x faster than with simple default Akka setup!

You can see, cumulative time of *seePlaces* is now longer than application run time, which depends on *readXml-parseXml* block.
Technically all I could do here (tiny i5 dual core) was to off-load anything from heavy StAX loop and get more hands for *seePlaces*. Perhaps with more cores, I could be able to "provide" xml/stax with a single core, but it is now a biggest bottleneck anyway.

Divide and conquer!

There is a room for further enhancements - I can do something about this linear stax loop, for example divide whole XML file into few chunks, where the split point should be between nearest </page> <page> elements and process them concurrently.

On bigger machines / clusters there will be more questions: How big should be a pool of specific actors? Auto-resized? How should it change over the execution time, what strategies can we use? Somehow it should depend on queues, latencies and number of waiting messages and priorities for specific tasks/actors (e.g. metrics can wait longer) and adopting routing strategies to it - so we can react to backlogs of messages waiting to be processed. And so on...

Check it

All this code is in github (akka, linear version). You should only need sbt to get them working.

Sunday, 15 June 2014

Strings and numbers: Scala can be easier than Ruby or Python thanks to type inference

When I was working on some python code I realized, gluing different parameters together is actually more complicated in Python (or Ruby) than in Scala or even in Java!

Someone would expect statically typed languages should be "by default" more complicated than flexible, dynamic Python or Ruby. Actually it may not be correct.

Scala has a really nice feature called types inference. This is something that could fill so natural and saves your time so often, you would wonder later why other languages are so resistant to pick it up. Let's see examples.

I would like to make string and int concatenation, like a code-number of a flight or road, e.g. aaa123

"aaa" + 123
foo + bar

Scala (sbt console):

scala> val foo = "aaa"
foo: String = aaa
scala> val bar = 123
bar: Int = 123
scala> foo + bar
res0: String = aaa123

Simple!

Python (ipython):

In [1]: foo = "aaa"
In [2]: bar = 123
In [3]: foo + bar
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in ()
----> 1 foo + bar
TypeError: cannot concatenate 'str' and 'int' objects

In [4]: foo + str(bar)
Out[4]: 'aaa123'

Ruby (irb):

And how about if we want to add another number value:

scala> val baz = 3
baz: Int = 3
scala> foo + bar + baz
res1: String = aaa1233
scala> baz + bar + foo
res2: String = 126aaa

but perhaps I need aaa126?

I can just use parentheses to show the proper execution flow so type inference will not change the context of second operator:

scala> foo + (bar + baz)
res4: String = aaa126

To do the same in Python you would need explicit conversions:

In [8]: str(baz + bar) + foo
Out[8]: '126aaa'

and similar conversion thing in Ruby:

>> (bar + baz).to_s + foo
=> "126aaa"

This could feel a little bit complicated, especially when you see than even in this monstrous, boiler-plate driven Java you can also write just like this:

String foo = "aaa";
int bar = 123;
System.out.println(foo + bar); // aaa123

int baz = 3;

System.out.println(foo + bar + baz); // aaa1233
System.out.println(bar + baz + foo); // 126aaa
System.out.println(foo + (bar + baz)); // aaa126

How about Perl, the old school Swiss Army knife for data processing? Perl (and in similar fashion Visual Basic), gives you a choice - what kind of "addition" you want to do, a math (+) or a string concatenation (. for Perl or & if this is VB):

perl -le ' $foo="aaa"; $bar=123; $baz=3; print $bar + $baz.$foo '
126aaa

perl -le ' $foo="aaa"; $bar=123; $baz=3; print $foo.($bar + $baz) '
aaa126

Just remember Perl is trying conversion really hard - be sure this is what you really want, like here:

perl -le ' $gbp = "100gbp"; $eur = "100eur"; print($gbp + $eur) '
200