So today I was experimenting with various languages in order to make the GHTorrent MySQL "CSV" dumps to behave like RFC-compliant CSV files. This involved parsing multi-GB, UTF-8 encoded files and running a small state-machine at the character level. I started with Ruby, but it was slow:
$ time ruby csvify.rb projects.csv >/dev/null
real 0m36.714s
user 0m35.689s
sys 0m0.371s
Surely Python should be way faster, right?
$ time python csvify.py projects.csv >/dev/null
real 0m56.429s
user 0m55.625s
sys 0m0.381s
Ooops! Even worse. I have been reading lots of good stuff about Rust, the language itself looked suspiciously similar to my beloved Scala (with the nice addition that it compiles down to the metal), so I gave it a try
$ rustup run nightly rustc -O csvify.rs
$ time ./csvify projects.csv >/dev/null
real 0m16.289s
user 0m15.819s
sys 0m0.310s
Nice!! I am using the nightly version to enable the very convenient Reader.chars()
function. The Rust version is comparable in size to Ruby (my goto language for every
day hacks) and is dramatically faster. It did not take me more than 2hrs to implement
the Rust solution, even though I am a complete n00b with the language. I will take
Rust more seriously.