Sunday, 16 November 2014

Scala/Play with sbt on IBM Bluemix (Heroku-like app cloud / PaaS)

0) IBM Bluemix is an app-cloud, similiar to Heroku, deployed with CloudFoundry on their own cloud infrastructure.

1) It works.

2) Just make a simple shell script that is going to call sbt/activator. Example https://hub.jazz.net/project/dev8661/BluePlayScala/overview
No need for gradle/exec builder wrapper.

3) Do not forget to include manifest.yml to your project. Yes, Bluemix could be more verbose about the fact that this file is missing / it has no idea what is where and how it should be deployed.

3) Check dashboard logs - it doesn't help to starve your scala app with too low memory (mine got ok when I increased mem from 128 to 512, even if 512 was already an upper limit when started with 128).

4) I like pipelines - build/CI infrastructure. This is how you can pack complex tasks into one clean overview.


5) Thanks for all cooperation during last Krk Scala hack-meetup - it was worth to attend and "get something done".

Thursday, 19 June 2014

Parallel XML processing with Scala & Akka - 3x faster!

I wanted to check how easy or hard it is to optimise XML processing with parallel/concurrent approach. I have some experience doing it with e.g. Java/Spring (as much boring as complicated), but I wanted to see how easy it could be to use a JVM Scala and Akka, an actor-based toolkit for that kind of ETL data transformations.

I prepared 2 projects that are doing exactly the same thing - processing Wiki-voyage XML and wiki-markup file and getting some statistics about it.
git clone https://github.com/codesurf42/wikiParser.git
git clone https://github.com/codesurf42/wikiParserLinear.git

The first one is based on Akka actors for concurrent processing and Akka agents for counting data, another one is a reference solution (linear processing) to compare efficiency.
There are also simple metrics counting execution time and number of calls - I found them very useful to check consistency across both solutions too.

Think asynchronously



XML is processed in a StAX way, this is a linear loop over the input xml stream and due to its nature (keeping current state) that loop has to be single.

Actors


The good idea is to keep them simple, one type of job per actor - that way it is simple to tune system by creating pool of actors when the stage is significantly slow than others.

Agents


They are used to properly count single values across actors/threads. They can be very simple like a counter:

Parser.agentCount.send(_ + 1)

(the simple var count "in actor" is actually counting, how many messages a single actor processed).

They can also take a function to compute, like for a longest article title:

 
// we can just compare which one is the longest
Parser.agentMaxArtTitle.send(current =>
  if (current.length < e.title.length)
    e.title
  else
    current
)




Metrics


There are also some in-project metrics gathered by using actors too - it is very important to see where the time is spend and where we have bottlenecks to prioritise our effort on improving it. They can also help to check consistency across different solutions - simply we should get the same numbers.

Typesafe console


This is a nice tool for profiling your Akka app. You can watch real-time numbers of messages processed by actors, latency in queues and more. This project is down-versioned to the latest Typesafe console version - to start it with this monitoring tool, simply use:

sbt atmos:run

Typesafe console is slowing down the runtime and sometimes can go crazy - just to be aware.

Tests


There are some BDD-style tests for parsing wiki markup. Writing regexp in Scala is easier than in Java (eg. better quoting of strings, you can avoid \\n within triple doublequotes etc.), but it is still not that clean and pure experience like in Perl. Shame.

Time gains and latency


It happened that even when using quite complex regexp for parsing article sections it was too fast to see how actors are optimising data flow. So I deliberately added 1ms delay in WikiParising.getSection to show these effects better.

Here are timings from WikiParserLinear (in microseconds, with a number of calls):

Exec time: 245.12 sec
LongestArt: 106 (File:FireShot Screen Capture -002 - 'Bali – Travel guides at Wikivoyage' - en wikivoyage org wiki Bali.jpg), count: 49788, 49788
execTime:
         readXml-parseXml:    245083591,          1
               xmlHasNext:    215568312,    4548859
           parseArticle-2:    119644815,      49788
                seePlaces:    118535075,      49788
                 getGeo-1:       549990,      18939
           parseArticle-1:       434810,      49788
           longestArticle:       245901,      49788
                 getGeo-2:        54741,      30849
         readXml-FromFile:        40245,          1
          seePlacesLength:        31297,      49788
[success] Total time: 265 s, completed 19-Jun-2014 00:29:31

and WikiParser (akka) with default actor settings, without any router configurations:

Exec time: 220.03 sec
LongestArt: 0 (File:FireShot Screen Capture -002 - 'Bali – Travel guides at Wikivoyage' - en wikivoyage org wiki Bali.jpg), count: 49788, 24894
execTime:
         readXml-parseXml:    219988018,          1
                seePlaces:    120414366,      49685
               xmlHasNext:     92934186,    4548859
           longestArticle:      1044310,      49788
           parseArticle-1:       804609,      49788
                 getGeo-1:       535376,      18939
           parseArticle-2:       363918,      49788
                 getGeo-2:        77243,      30849
         readXml-FromFile:        44811,          1
          seePlacesLength:        31267,      49685
       count_parseArticle:            0,      49788
execTime:
         readXml-parseXml:    219988018,          1
                seePlaces:    120704828,      49788
               xmlHasNext:     92934186,    4548859
           longestArticle:      1044310,      49788
           parseArticle-1:       804609,      49788
                 getGeo-1:       535376,      18939
           parseArticle-2:       363918,      49788
                 getGeo-2:        77243,      30849
         readXml-FromFile:        44811,          1
          seePlacesLength:        31297,      49788
       count_parseArticle:            0,      49788

This is already faster, as for example *seePlaces* takes time when *readXml-parseXml* happens.
There are two dumps of counters as they are not exactly the same - there are still some messages being processed, like these in seePlaces (49685 / 49788) or seePlacesLength.

Tunning


Now, because we have some bottlenecks, we can do something about them. Let's just get more workers here:

val seePl = system.actorOf(Props[ArticleSeePlacesParser].withRouter(RoundRobinRouter(3)), "seePlaces")


How it changed timings?

Exec time: 91.51 sec
LongestArt: 0 (File:FireShot Screen Capture -002 - 'Bali – Travel guides at Wikivoyage' - en wikivoyage org wiki Bali.jpg), count: 49788, 49788
execTime:
                seePlaces:    123174283,      49178
         readXml-parseXml:     91491233,          1
               xmlHasNext:     67099980,    4548859
           longestArticle:      2419829,      49788
           parseArticle-2:       647368,      49788
           parseArticle-1:       609389,      49788
                 getGeo-1:       566737,      18939
                 getGeo-2:        80348,      30849
          seePlacesLength:        31106,      49178
         readXml-FromFile:        15315,          1
       count_parseArticle:            0,      49788

This is almost 3x faster that linear version and 2.5x faster than with simple default Akka setup!

You can see, cumulative time of *seePlaces* is now longer than application run time, which depends on *readXml-parseXml* block.
Technically all I could do here (tiny i5 dual core) was to off-load anything from heavy StAX loop and get more hands for *seePlaces*. Perhaps with more cores, I could be able to "provide" xml/stax with a single core, but it is now a biggest bottleneck anyway.

Divide and conquer!


There is a room for further enhancements - I can do something about this linear stax loop, for example divide whole XML file into few chunks, where the split point should be between nearest &lt;/page> &lt;page> elements and process them concurrently.
On bigger machines / clusters there will be more questions: How big should be a pool of specific actors? Auto-resized? How should it change over the execution time, what strategies can we use? Somehow it should depend on queues, latencies and number of waiting messages and priorities for specific tasks/actors (e.g. metrics can wait longer) and adopting routing strategies to it - so we can react to backlogs of messages waiting to be processed. And so on...

Check it


All this code is in github (akka, linear version). You should only need sbt to get them working.

Sunday, 15 June 2014

Strings and numbers: Scala can be easier than Ruby or Python thanks to type inference

When I was working on some python code I realized, gluing different parameters together is actually more complicated in Python (or Ruby) than in Scala or even in Java!

Someone would expect statically typed languages should be "by default" more complicated than flexible, dynamic Python or Ruby. Actually it may not be correct.

Scala has a really nice feature called types inference. This is something that could fill so natural and saves your time so often, you would wonder later why other languages are so resistant to pick it up. Let's see examples.

I would like to make string and int concatenation, like a code-number of a flight or road, e.g. aaa123
"aaa" + 123
foo + bar

Scala (sbt console):

scala> val foo = "aaa"
foo: String = aaa
scala> val bar = 123
bar: Int = 123
scala> foo + bar
res0: String = aaa123
Simple!

Python (ipython):
In [1]: foo = "aaa"
In [2]: bar = 123
In [3]: foo + bar
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in ()
----> 1 foo + bar
TypeError: cannot concatenate 'str' and 'int' objects

In [4]: foo + str(bar)
Out[4]: 'aaa123'
Ruby (irb):


And how about if we want to add another number value:

scala> val baz = 3
baz: Int = 3
scala> foo + bar + baz
res1: String = aaa1233
scala> baz + bar + foo
res2: String = 126aaa
but perhaps I need aaa126?

I can just use parentheses to show the proper execution flow so type inference will not change the context of second operator:

scala> foo + (bar + baz)
res4: String = aaa126
To do the same in Python you would need explicit conversions:
In [8]: str(baz + bar) + foo
Out[8]: '126aaa'
and similar conversion thing in Ruby:
>> (bar + baz).to_s + foo
=> "126aaa"
This could feel a little bit complicated, especially when you see than even in this monstrous, boiler-plate driven Java you can also write just like this:

String foo = "aaa";
int bar = 123;
System.out.println(foo + bar); // aaa123

int baz = 3;

System.out.println(foo + bar + baz); // aaa1233
System.out.println(bar + baz + foo); // 126aaa
System.out.println(foo + (bar + baz)); // aaa126

How about Perl, the old school Swiss Army knife for data processing? Perl (and in similar fashion Visual Basic), gives you a choice - what kind of "addition" you want to do, a math (+) or a string concatenation (. for Perl or & if this is VB):

perl -le ' $foo="aaa"; $bar=123; $baz=3; print $bar + $baz.$foo '
126aaa

perl -le ' $foo="aaa"; $bar=123; $baz=3; print $foo.($bar + $baz) '
aaa126

Just remember Perl is trying conversion really hard - be sure this is what you really want, like here:

perl -le ' $gbp = "100gbp"; $eur = "100eur"; print($gbp + $eur) '
200

Saturday, 23 November 2013

How to kill (softly) a software project

1. Monolith.
Keep the whole code base as a one big stack. That way after few years no one will know how it exactly works. You can always try to hire more and more highly skilled developers to work on it (if you have enough money and there are still such people on the market), but most of their time will be spend on solving problems that were purely created by such design. Simply speaking, 80% of your money will go into preserving the current "state of art", not into new business features.
There is another problem, driven by complexity - dependency war, when feature A uses library L in a version v1, but module C used by feature B needs L v2. This is the case when you have a good dependency management system - otherwise they can go silently into your prod but later will need endless time to debug them.

2. Closed stack, NIH.
There should be an enormous effort to add a new Open Source library. Three levels of managers that have no dev-clue should agree. A procedure to get a new CPAN module should be such a pain in the ass that no one would like to do it again. Forget about cloned images, maven/sbt/capistrano/perlbrew/setuptools/gradle or system packaging for easier distribution on your servers. That way your money will be spend on rediscovering wheels again and again. This seems to be a common approach in many big companies with "secure programming" and "not invented here" paradigms. It is a quick way to kill any innovation like newer templates, better JSON handlers or just day to day improvements.

3. Always stay with the same database. Do not try to evaluate new versions, forget about all these crappy no-sql approaches. Who cares what Redis could be better than Memcache, that Mongo could do a better job that MySQL, that Cassandra could be better than your oldy-goody shared disk? Enough is enough.

Oh, crap. What is wrong with so many companies?

Tuesday, 30 October 2012

Want to do maths? You are 120x better off with Scala than Perl

To be honest, I did expect some difference in performance between statically typed Scala (JVM-based language) and dynamic Perl... but maybe not that big. Anyway, check it out.

Here is a very simple way to compute prime numbers:

==== PrimeNumbers.scala
object PrimeNumbers {
   def isPrime(n: Int): Boolean = (2 until n) forall (d => n % d != 0)
 
   def main(args: Array[String]): Unit = {
    (8999999 until 8999999+200).map(x => if (isPrime(x)) print(x+" ") )
   }
}

==== prime_numbers.pl
sub isPrime {
    my $n = shift;
    grep( $n % $_ == 0, (2..($n-1))) ? 0 : 1
}

for(8999999..(8999999+200)) {
    isPrime($_) && print "$_\n"
}


time scala PrimeNumbers
9000011 9000041 9000049 9000059 9000067 9000119 9000127 9000143 9000163 9000193
real    0m3.187s

time perl prime_numbers.pl
real    6m23.221s

Really? This huge JVM-bloated Scala is about 120x faster in such things than Perl... can't belive!

(Perl v5.14.2, Scala 2.9.1)

Friday, 4 May 2012

Missed your Germanwings flight departure from London Heathrow or Stansted?

Do you trust your airline when checking your flight departure? If they have no idea of "local time" and concept of time zones, be careful or you can miss your flight! I just wonder how many passengers of Germanwings airline flying from UK are going to find that their flight has just gone... according to schedule.... errrr... what schedule? They publish departures times in "their" time zone, not the local one. It is weird, but... well... it is Germanwings. You've been warned.

 Germanwings airline website:
London Heathrow airport website:

Ps - I spent some creative time working with time zones code. It is definitely not an easy thing even if it seems to be quite obvious.
Ps 2 - Germanwings has no email, only premium phone numbers. No way to report them a bug, sorry Germanwings passengers.
Ps 3 - It is not like intermittent - after 2 weeks, they are still proud to have this silly bug.
Ps 4 - It looks like Germanwings... well... I hope they are never going to fly from New York or Dubai.

Tuesday, 8 November 2011

Grub, no count down after failed run, how to fix it

This default grub behavior could be very annoying especially with virtual machines running headless or whatever should stand up after failure. It could be fixed by modifying make_timeout () in /etc/grub.d/00_header: Original code: ****
make_timeout ()
{
    cat << EOF
if [ "\${recordfail}" = 1 ]; then
  set timeout=-1
else
  set timeout=${2}
fi
EOF
}
**** here is how to fix it: ****
make_timeout ()
{
    cat << EOF
if [ "\${recordfail}" = 1 ]; then
  set timeout=${2}
else
  set timeout=${2}
fi
EOF
}
**** /I know it is a "secure programming" approach in terms of how this code looks like now... and yes, it is/ The original hint is taken from here: http://forum.xbmc.org/showthread.php?t=64857

Tuesday, 11 October 2011

cssh - cluster ssh - do your cluster work easier

This is a very simple but powerful tool for interactive work with many machines at once
Just log in to them
cssh node1 node2 node3
and do the same operations on all of them.

It needs x-window (mmm, just wonder if there is a pure-text replacement?) and it is worth to configure ssh keys and perhaps ssh-agent to make your logins easier but still safe.
cssh comes with packages for ubuntu/debian, not sure if redhat/centos rpms catch up here. This project page is http://sourceforge.net/projects/clusterssh/
Many thanks for this tool!

Tuesday, 27 September 2011

CPAN of the week

Carp::Always - this is what you need when perl -MCarp=verbose doesn't work the way you expect it. Spotted in some post by JRockway.

Mock::Quick - mock whatever you want, but also takeover/partially override existing modules and classes. The beauty of certainty for TDD or just when escaping from too many dependencies.

Sunday, 5 June 2011

Git ignore file

I just find myself like searching for this again and again as I am not a very heavy git user now, but setting global core.excludesfile seems to be very similar to svn/cvs approach.

How to ignore certain types of files in git?

1) Put the excludes in your $GIT_DIR/info/exclude file (.git/info/exclude), if this is specific to one tree.

2) Run git config --global core.excludesfile ~/.gitignore and add patterns to your ~/.gitignore. This option applies if you want to ignore certain patterns across all trees.

/taken from StackOverflow by Emil

Wednesday, 16 March 2011

Back ticks equivalent in cmd / windows

How to get a result of one command and use it in another command in Windows shell, like with backticks in Unix/Linux shell?

Here is a working example:

for /f "tokens=1* delims=" %%s in ('gnuwin32\date +"%%Y%%m%%d_%%H%%M"') do echo output_%%s.log

> output_20110215_1057.log


Works in Windows XP, Vista, Windows 7.

Remember:
1) The variable used in a for loop must be a one letter (at least in XP, Vista)
2) gnuwin32 is a set of GNU tools like grep/sed/awk/date which can be run natively in Windows without cost of additonal Cygwin-like layers

Thursday, 10 February 2011

Mr Openoffice is happy with default A4 page

This is that kind of thing that you know you want to change, but there is always no time to do it. But it is just a minute, really!

http://selinap.com/2008/12/how-to-set-openoffice-default-page-format-to-a4/

Simply:
1) change page format
2) save it as a template (File -> template)
3) set this template as a default (File -> template -> organize -> commands -> set as a default tpl)

Do it, do it!