fxn · May 22, 2011 16:06 · vijaydev · May 22, 2011 · foca · May 22, 2011
diff --git a/geoplanet_processing.txt b/geoplanet_processing.txt
 I wanted to process the Yahoo! GeoPlanet places CSV, which has about
 5.7 million rows, to compute the ancestors of every place and cache
 them in an additional column. That's related to a post I am writing
 about importing this data into Postgres.

 The CSV file has the WOEID of each places in the first field, and the
 WOEID of their parent in the last field. I wanted to initialize a hash
 that mapped the former to an array containing the latter, to build the
 ancestor chain of each place there later.

 Since this is in the context of a Rails application run by Ruby 1.8.7
 I tried first that, but performance was horrible:

    ancestors = {}
    
    while line = gets
      next unless line =~ /\A\d/
      line.chomp!
      
      fields = line.split("\t")
      ancestors[fields.first] = [fields.last]
    end

 Out of curiosity I wrote solutions in Perl

    use strict;
    
    my %ancestors = ();

    while (my $line = <>) {
        next unless $line =~ /^\d/;
        chomp $line;
        
        my @fields = split /\t/, $line;
        $ancestors{$fields[0]} = [$fields[-1]]
    }

 and Python (hope it is idiomatic enough):

    import fileinput
    import re
    
    ancestors = {}

    for line in fileinput.input():
        if re.match(r'\d', line):
            fields = line.rstrip("\n").split("\t")
            ancestors[fields[0]] = [fields[-1]]

 I did several runs with each interpreter, this is a MacBook Pro 13''
 from mid-2009 with a OWC SSD that does about 280 Mb/s. The input
 file is geoplanet_places_7.6.0.tsv, included in the zip file
 http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in 
 http://developer.yahoo.com/geo/geoplanet/data/. These are the numbers:

    +----------+----------------+--------+--------+---------+---------+
    | Language |  Interpreter   |  I/O   | split  |   map   |  TOTAL  |
    +----------+----------------+--------+--------+---------+---------+
    | Ruby     | Rubinius HEAD  | 1m  7s | 1m 42s | 20m 19s | 23m  8s |
    | Ruby     | MRI 1.8.7-p334 |     8s |    34s |  7m  8s |  7m 50s | 
    | Ruby     | MRI 1.9.2-p180 |     8s |    12s |  1m 54s |  2m 14s |
    | Ruby     | JRuby 1.6.1    |     9s |     7s |  1m  8s |  1m 28s |
    | Python   | CPython 2.7.1  |    23s |    12s |     25s |  1m  0s |
    | Perl     | Perl 5.12.3    |     4s |    25s |     25s |     54s |
    +----------+----------------+---------------------------+---------+

 The "I/O", "split", and "map" columns are simple time splits measured
 commenting out their corresponding lines.

 The "I/O" column measures the loop including the regexp match and the
 chomp in the case of Ruby and Perl. The "split" column measures the
 cost of splitting the lines itself, and the "map" column measures new
 array instantiatiation and building the hash, the last line in the
 loop body.

 JRuby was run with the --server flag, see also Charles's comments below.
 I suspect there's something going on in Rubinius HEAD that skews the
 results, since it is a fast interpreter generally speaking.

 I went with Perl in the end for this particular problem. Processing
 the original CSV, building the complete ancestors chain for each place,
 and printing the original CSV with the ancestry cache as an additional
 column, takes about 2 minutes (https://gist.github.com/983977).

 The table above should not be extrapolated, it is a very concrete
 comparison, but one that was meaningful for the task at hand.
	I wanted to process the Yahoo! GeoPlanet places CSV, which has about
	5.7 million rows, to compute the ancestors of every place and cache
	them in an additional column. That's related to a post I am writing
	about importing this data into Postgres.

	The CSV file has the WOEID of each places in the first field, and the
	WOEID of their parent in the last field. I wanted to initialize a hash
	that mapped the former to an array containing the latter, to build the
	ancestor chain of each place there later.

	Since this is in the context of a Rails application run by Ruby 1.8.7
	I tried first that, but performance was horrible:

	ancestors = {}

	while line = gets
	next unless line =~ /\A\d/
	line.chomp!

	fields = line.split("\t")
	ancestors[fields.first] = [fields.last]
	end

	Out of curiosity I wrote solutions in Perl

	use strict;

	my %ancestors = ();

	while (my $line = <>) {
	next unless $line =~ /^\d/;
	chomp $line;

	my @fields = split /\t/, $line;
	$ancestors{$fields[0]} = [$fields[-1]]
	}

	and Python (hope it is idiomatic enough):

	import fileinput
	import re

	ancestors = {}

	for line in fileinput.input():
	if re.match(r'\d', line):
	fields = line.rstrip("\n").split("\t")
	ancestors[fields[0]] = [fields[-1]]

	I did several runs with each interpreter, this is a MacBook Pro 13''
	from mid-2009 with a OWC SSD that does about 280 Mb/s. The input
	file is geoplanet_places_7.6.0.tsv, included in the zip file
	http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in
	http://developer.yahoo.com/geo/geoplanet/data/. These are the numbers:

	+----------+----------------+--------+--------+---------+---------+
	\| Language \| Interpreter \| I/O \| split \| map \| TOTAL \|
	+----------+----------------+--------+--------+---------+---------+
	\| Ruby \| Rubinius HEAD \| 1m 7s \| 1m 42s \| 20m 19s \| 23m 8s \|
	\| Ruby \| MRI 1.8.7-p334 \| 8s \| 34s \| 7m 8s \| 7m 50s \|
	\| Ruby \| MRI 1.9.2-p180 \| 8s \| 12s \| 1m 54s \| 2m 14s \|
	\| Ruby \| JRuby 1.6.1 \| 9s \| 7s \| 1m 8s \| 1m 28s \|
	\| Python \| CPython 2.7.1 \| 23s \| 12s \| 25s \| 1m 0s \|
	\| Perl \| Perl 5.12.3 \| 4s \| 25s \| 25s \| 54s \|
	+----------+----------------+---------------------------+---------+

	The "I/O", "split", and "map" columns are simple time splits measured
	commenting out their corresponding lines.

	The "I/O" column measures the loop including the regexp match and the
	chomp in the case of Ruby and Perl. The "split" column measures the
	cost of splitting the lines itself, and the "map" column measures new
	array instantiatiation and building the hash, the last line in the
	loop body.

	JRuby was run with the --server flag, see also Charles's comments below.
	I suspect there's something going on in Rubinius HEAD that skews the
	results, since it is a fast interpreter generally speaking.

	I went with Perl in the end for this particular problem. Processing
	the original CSV, building the complete ancestors chain for each place,
	and printing the original CSV with the ancestry cache as an additional
	column, takes about 2 minutes (https://gist.github.com/983977).

	The table above should not be extrapolated, it is a very concrete
	comparison, but one that was meaningful for the task at hand.