Skip to content

Instantly share code, notes, and snippets.

@fxn
Created May 22, 2011 16:06
Show Gist options
  • Save fxn/4e02331b252b99e9583b to your computer and use it in GitHub Desktop.
Save fxn/4e02331b252b99e9583b to your computer and use it in GitHub Desktop.
Runtime comparison of a simple problem solved in Ruby, Python, and Perl
I wanted to process the Yahoo! GeoPlanet places CSV, which has about
5.7 million rows, to compute the ancestors of every place and cache
them in an additional column. That's related to a post I am writing
about importing this data into Postgres.
The CSV file has the WOEID of each places in the first field, and the
WOEID of their parent in the last field. I wanted to initialize a hash
that mapped the former to an array containing the latter, to build the
ancestor chain of each place there later.
Since this is in the context of a Rails application run by Ruby 1.8.7
I tried first that, but performance was horrible:
ancestors = {}
while line = gets
next unless line =~ /\A\d/
line.chomp!
fields = line.split("\t")
ancestors[fields.first] = [fields.last]
end
Out of curiosity I wrote solutions in Perl
use strict;
my %ancestors = ();
while (my $line = <>) {
next unless $line =~ /^\d/;
chomp $line;
my @fields = split /\t/, $line;
$ancestors{$fields[0]} = [$fields[-1]]
}
and Python (hope it is idiomatic enough):
import fileinput
import re
ancestors = {}
for line in fileinput.input():
if re.match(r'\d', line):
fields = line.rstrip("\n").split("\t")
ancestors[fields[0]] = [fields[-1]]
I did several runs with each interpreter, this is a MacBook Pro 13''
from mid-2009 with a OWC SSD that does about 280 Mb/s. The input
file is geoplanet_places_7.6.0.tsv, included in the zip file
http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in
http://developer.yahoo.com/geo/geoplanet/data/. These are the numbers:
+----------+----------------+--------+--------+---------+---------+
| Language | Interpreter | I/O | split | map | TOTAL |
+----------+----------------+--------+--------+---------+---------+
| Ruby | Rubinius HEAD | 1m 7s | 1m 42s | 20m 19s | 23m 8s |
| Ruby | MRI 1.8.7-p334 | 8s | 34s | 7m 8s | 7m 50s |
| Ruby | MRI 1.9.2-p180 | 8s | 12s | 1m 54s | 2m 14s |
| Ruby | JRuby 1.6.1 | 9s | 7s | 1m 8s | 1m 28s |
| Python | CPython 2.7.1 | 23s | 12s | 25s | 1m 0s |
| Perl | Perl 5.12.3 | 4s | 25s | 25s | 54s |
+----------+----------------+---------------------------+---------+
The "I/O", "split", and "map" columns are simple time splits measured
commenting out their corresponding lines.
The "I/O" column measures the loop including the regexp match and the
chomp in the case of Ruby and Perl. The "split" column measures the
cost of splitting the lines itself, and the "map" column measures new
array instantiatiation and building the hash, the last line in the
loop body.
JRuby was run with the --server flag, see also Charles's comments below.
I suspect there's something going on in Rubinius HEAD that skews the
results, since it is a fast interpreter generally speaking.
I went with Perl in the end for this particular problem. Processing
the original CSV, building the complete ancestors chain for each place,
and printing the original CSV with the ancestry cache as an additional
column, takes about 2 minutes (https://gist.github.com/983977).
The table above should not be extrapolated, it is a very concrete
comparison, but one that was meaningful for the task at hand.
@headius
Copy link

headius commented May 23, 2011

There's definitely a heavy GC component here, so any help in that area could improve results for both 1.8.7 and 1.9. Even on JRuby a substantial heap and some GC tweaking were necessary to get solid perf (for me).

The benchmark/algo could probably also be improved to avoid such high object counts (split is a big part of it here) which could help all impls.

@fxn
Copy link
Author

fxn commented May 23, 2011

@skaes hey :), the input file is geoplanet_places_7.6.0.tsv, included in the Zip file http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in http://developer.yahoo.com/geo/geoplanet/data/.

@fxn
Copy link
Author

fxn commented May 23, 2011

@joaquinferrero I've moved the chomp down finally, albeit performance is unaffected because it applies to all lines but one or two, a reader of the code with no knowledge of the data sees a chomp that might be unnecessary. Chomping after the next line is more normal. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment