-
-
Save fxn/4e02331b252b99e9583b to your computer and use it in GitHub Desktop.
Runtime comparison of a simple problem solved in Ruby, Python, and Perl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I wanted to process the Yahoo! GeoPlanet places CSV, which has about | |
5.7 million rows, to compute the ancestors of every place and cache | |
them in an additional column. That's related to a post I am writing | |
about importing this data into Postgres. | |
The CSV file has the WOEID of each places in the first field, and the | |
WOEID of their parent in the last field. I wanted to initialize a hash | |
that mapped the former to an array containing the latter, to build the | |
ancestor chain of each place there later. | |
Since this is in the context of a Rails application run by Ruby 1.8.7 | |
I tried first that, but performance was horrible: | |
ancestors = {} | |
while line = gets | |
next unless line =~ /\A\d/ | |
line.chomp! | |
fields = line.split("\t") | |
ancestors[fields.first] = [fields.last] | |
end | |
Out of curiosity I wrote solutions in Perl | |
use strict; | |
my %ancestors = (); | |
while (my $line = <>) { | |
next unless $line =~ /^\d/; | |
chomp $line; | |
my @fields = split /\t/, $line; | |
$ancestors{$fields[0]} = [$fields[-1]] | |
} | |
and Python (hope it is idiomatic enough): | |
import fileinput | |
import re | |
ancestors = {} | |
for line in fileinput.input(): | |
if re.match(r'\d', line): | |
fields = line.rstrip("\n").split("\t") | |
ancestors[fields[0]] = [fields[-1]] | |
I did several runs with each interpreter, this is a MacBook Pro 13'' | |
from mid-2009 with a OWC SSD that does about 280 Mb/s. The input | |
file is geoplanet_places_7.6.0.tsv, included in the zip file | |
http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in | |
http://developer.yahoo.com/geo/geoplanet/data/. These are the numbers: | |
+----------+----------------+--------+--------+---------+---------+ | |
| Language | Interpreter | I/O | split | map | TOTAL | | |
+----------+----------------+--------+--------+---------+---------+ | |
| Ruby | Rubinius HEAD | 1m 7s | 1m 42s | 20m 19s | 23m 8s | | |
| Ruby | MRI 1.8.7-p334 | 8s | 34s | 7m 8s | 7m 50s | | |
| Ruby | MRI 1.9.2-p180 | 8s | 12s | 1m 54s | 2m 14s | | |
| Ruby | JRuby 1.6.1 | 9s | 7s | 1m 8s | 1m 28s | | |
| Python | CPython 2.7.1 | 23s | 12s | 25s | 1m 0s | | |
| Perl | Perl 5.12.3 | 4s | 25s | 25s | 54s | | |
+----------+----------------+---------------------------+---------+ | |
The "I/O", "split", and "map" columns are simple time splits measured | |
commenting out their corresponding lines. | |
The "I/O" column measures the loop including the regexp match and the | |
chomp in the case of Ruby and Perl. The "split" column measures the | |
cost of splitting the lines itself, and the "map" column measures new | |
array instantiatiation and building the hash, the last line in the | |
loop body. | |
JRuby was run with the --server flag, see also Charles's comments below. | |
I suspect there's something going on in Rubinius HEAD that skews the | |
results, since it is a fast interpreter generally speaking. | |
I went with Perl in the end for this particular problem. Processing | |
the original CSV, building the complete ancestors chain for each place, | |
and printing the original CSV with the ancestry cache as an additional | |
column, takes about 2 minutes (https://gist.github.com/983977). | |
The table above should not be extrapolated, it is a very concrete | |
comparison, but one that was meaningful for the task at hand. |
@joaquinferrero I've moved the chomp down finally, albeit performance is unaffected because it applies to all lines but one or two, a reader of the code with no knowledge of the data sees a chomp that might be unnecessary. Chomping after the next line is more normal. Thanks!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@skaes hey :), the input file is geoplanet_places_7.6.0.tsv, included in the Zip file http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in http://developer.yahoo.com/geo/geoplanet/data/.