Skip to content

Instantly share code, notes, and snippets.

@gousiosg
Last active September 23, 2016 14:55
Show Gist options
  • Save gousiosg/1aa16e8dd79828316059db2b532767ae to your computer and use it in GitHub Desktop.
Save gousiosg/1aa16e8dd79828316059db2b532767ae to your computer and use it in GitHub Desktop.
Experiments with various languages on low level file parsing

So today I was experimenting with various languages in order to make the GHTorrent MySQL "CSV" dumps to behave like RFC-compliant CSV files. This involved parsing multi-GB, UTF-8 encoded files and running a small state-machine at the character level. I started with Ruby, but it was slow:

$ time ruby csvify.rb projects.csv >/dev/null

real	0m36.714s
user	0m35.689s
sys	0m0.371s

Surely Python should be way faster, right?

$ time python csvify.py projects.csv >/dev/null

real	0m56.429s
user	0m55.625s
sys	0m0.381s

Ooops! Even worse. I have been reading lots of good stuff about Rust, the language itself looked suspiciously similar to my beloved Scala (with the nice addition that it compiles down to the metal), so I gave it a try

$ rustup run nightly rustc -O csvify.rs
$ time ./csvify projects.csv >/dev/null

real	0m16.289s
user	0m15.819s
sys	0m0.310s

Nice!! I am using the nightly version to enable the very convenient Reader.chars() function. The Rust version is comparable in size to Ruby (my goto language for every day hacks) and is dramatically faster. It did not take me more than 2hrs to implement the Rust solution, even though I am a complete n00b with the language. I will take Rust more seriously.

#!/usr/bin/env python
import codecs
import sys
def chars_from_file(filename, chunksize=1024 * 1024 * 1024):
with codecs.open(filename, "rb") as f:
while True:
chunk = f.read(chunksize)
if chunk:
for b in chunk:
yield b
else:
break
may_be_eol = False
for c in chars_from_file('projects.csv'):
if c == '\\':
if may_be_eol:
sys.stdout.write('\\\\')
may_be_eol = False
else:
may_be_eol = True
elif c == "\n":
if may_be_eol:
sys.stdout.write(' ')
else:
sys.stdout.write(c)
may_be_eol = False
elif c == "\r":
sys.stdout.write(' ')
else:
if may_be_eol:
sys.stdout.write('\\')
sys.stdout.write(c
may_be_eol = False
#!/usr/bin/env ruby
if ARGV.size == 0
STDERR.puts "usage: csvify.rb file > output"
exit 1
end
may_be_eol = false
ARGF.each_char do |c|
case c
when '\\'
if may_be_eol
STDOUT.write '\\\\'
may_be_eol = false
else
may_be_eol = true
end
when "\n"
if may_be_eol
STDOUT.write ' '
else
STDOUT.write c
end
may_be_eol = false
when "\r"
STDOUT.write ' '
else
if may_be_eol
STDOUT.write '\\'
end
STDOUT.write c
may_be_eol = false
end
end
#![feature(io)]
use std::io::prelude::*;
use std::fs::File;
use std::io::BufReader;
use std::env;
fn main() {
let path = match env::args().nth(1) {
Some(p) => p,
None => panic!("One argument required")
};
let f = match File::open(path) {
Err(why) => panic!("couldn't open file:{}", why),
Ok(file) => file,
};
let stream = BufReader::new(f);
let mut maybe_eol = true;
for c in stream.chars() {
let r = c.unwrap();
match r {
'\\' => {
if maybe_eol {
print!("{}", r);
maybe_eol = false;
} else {
maybe_eol = true;
}
}
'\n' => {
if maybe_eol {
print!(" ");
} else {
print!("{}", r)
}
}
'\r' => print!(" "),
_ => {
if maybe_eol {
print!("{}", '\\');
}
print!("{}", r);
maybe_eol = false;
}
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment