-
-
Save fxn/4e02331b252b99e9583b to your computer and use it in GitHub Desktop.
I wanted to process the Yahoo! GeoPlanet places CSV, which has about | |
5.7 million rows, to compute the ancestors of every place and cache | |
them in an additional column. That's related to a post I am writing | |
about importing this data into Postgres. | |
The CSV file has the WOEID of each places in the first field, and the | |
WOEID of their parent in the last field. I wanted to initialize a hash | |
that mapped the former to an array containing the latter, to build the | |
ancestor chain of each place there later. | |
Since this is in the context of a Rails application run by Ruby 1.8.7 | |
I tried first that, but performance was horrible: | |
ancestors = {} | |
while line = gets | |
next unless line =~ /\A\d/ | |
line.chomp! | |
fields = line.split("\t") | |
ancestors[fields.first] = [fields.last] | |
end | |
Out of curiosity I wrote solutions in Perl | |
use strict; | |
my %ancestors = (); | |
while (my $line = <>) { | |
next unless $line =~ /^\d/; | |
chomp $line; | |
my @fields = split /\t/, $line; | |
$ancestors{$fields[0]} = [$fields[-1]] | |
} | |
and Python (hope it is idiomatic enough): | |
import fileinput | |
import re | |
ancestors = {} | |
for line in fileinput.input(): | |
if re.match(r'\d', line): | |
fields = line.rstrip("\n").split("\t") | |
ancestors[fields[0]] = [fields[-1]] | |
I did several runs with each interpreter, this is a MacBook Pro 13'' | |
from mid-2009 with a OWC SSD that does about 280 Mb/s. The input | |
file is geoplanet_places_7.6.0.tsv, included in the zip file | |
http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in | |
http://developer.yahoo.com/geo/geoplanet/data/. These are the numbers: | |
+----------+----------------+--------+--------+---------+---------+ | |
| Language | Interpreter | I/O | split | map | TOTAL | | |
+----------+----------------+--------+--------+---------+---------+ | |
| Ruby | Rubinius HEAD | 1m 7s | 1m 42s | 20m 19s | 23m 8s | | |
| Ruby | MRI 1.8.7-p334 | 8s | 34s | 7m 8s | 7m 50s | | |
| Ruby | MRI 1.9.2-p180 | 8s | 12s | 1m 54s | 2m 14s | | |
| Ruby | JRuby 1.6.1 | 9s | 7s | 1m 8s | 1m 28s | | |
| Python | CPython 2.7.1 | 23s | 12s | 25s | 1m 0s | | |
| Perl | Perl 5.12.3 | 4s | 25s | 25s | 54s | | |
+----------+----------------+---------------------------+---------+ | |
The "I/O", "split", and "map" columns are simple time splits measured | |
commenting out their corresponding lines. | |
The "I/O" column measures the loop including the regexp match and the | |
chomp in the case of Ruby and Perl. The "split" column measures the | |
cost of splitting the lines itself, and the "map" column measures new | |
array instantiatiation and building the hash, the last line in the | |
loop body. | |
JRuby was run with the --server flag, see also Charles's comments below. | |
I suspect there's something going on in Rubinius HEAD that skews the | |
results, since it is a fast interpreter generally speaking. | |
I went with Perl in the end for this particular problem. Processing | |
the original CSV, building the complete ancestors chain for each place, | |
and printing the original CSV with the ancestry cache as an additional | |
column, takes about 2 minutes (https://gist.github.com/983977). | |
The table above should not be extrapolated, it is a very concrete | |
comparison, but one that was meaningful for the task at hand. |
Assuming our machines aren't too far off in performance. 1m28s is pretty close to my 17s times 5 (for 15_000_000) rows. I think we could probably beat python and perl with some tweaking to the algorithm and to JRuby itself, but at least we're in the same ballpark now.
FWIW, I tried to make some tweaks to the script to help improve perf more on JITing implementations like JRuby and Rubinius:
- Run the data-processing method a few times with smaller number of iterations to allow it to "warm up" in the VM
- Only pay attention to the final "long" run
This appeared to help JRuby a bit, but not by a significant amount. The JVM generally can optimize running code, and running this at the command line causes it to be compiled into JVM bytecode immediately.
I was unable to get Rubinius numbers to improve much even with these changes. After about 15 minutes I gave up on it.
EDIT: I guess it wasn't 15 minutes, and it completed right after I posted this comment. Rubinius clocked in at almost exactly 7 minutes for the 3_000_000 run. That's significantly better than the 23 minutes you have above.
This excludes IO time, so it's benchmarking core class and Ruby execution performance.
Xavier, could you provide a link to the data used for the test? I'd like to run some tests with https://github.com/skaes/rvm-patchsets.
My guess is that the performance of the benchmarks is dominated by GC performance, so I'd like to experiment with a few GC tweaks.
There's definitely a heavy GC component here, so any help in that area could improve results for both 1.8.7 and 1.9. Even on JRuby a substantial heap and some GC tweaking were necessary to get solid perf (for me).
The benchmark/algo could probably also be improved to avoid such high object counts (split is a big part of it here) which could help all impls.
@skaes hey :), the input file is geoplanet_places_7.6.0.tsv, included in the Zip file http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in http://developer.yahoo.com/geo/geoplanet/data/.
@joaquinferrero I've moved the chomp down finally, albeit performance is unaffected because it applies to all lines but one or two, a reader of the code with no knowledge of the data sees a chomp that might be unnecessary. Chomping after the next line is more normal. Thanks!
FYI, this is a really massive amount of data, so make sure the heap is set large enough (-J-Xmx and -J-Xms flags above) and use that ConcMarkSweep flag to take better advantages of multiple cores during GC. It should easily beat any other Ruby impl.