-
-
Save fxn/4e02331b252b99e9583b to your computer and use it in GitHub Desktop.
| I wanted to process the Yahoo! GeoPlanet places CSV, which has about | |
| 5.7 million rows, to compute the ancestors of every place and cache | |
| them in an additional column. That's related to a post I am writing | |
| about importing this data into Postgres. | |
| The CSV file has the WOEID of each places in the first field, and the | |
| WOEID of their parent in the last field. I wanted to initialize a hash | |
| that mapped the former to an array containing the latter, to build the | |
| ancestor chain of each place there later. | |
| Since this is in the context of a Rails application run by Ruby 1.8.7 | |
| I tried first that, but performance was horrible: | |
| ancestors = {} | |
| while line = gets | |
| next unless line =~ /\A\d/ | |
| line.chomp! | |
| fields = line.split("\t") | |
| ancestors[fields.first] = [fields.last] | |
| end | |
| Out of curiosity I wrote solutions in Perl | |
| use strict; | |
| my %ancestors = (); | |
| while (my $line = <>) { | |
| next unless $line =~ /^\d/; | |
| chomp $line; | |
| my @fields = split /\t/, $line; | |
| $ancestors{$fields[0]} = [$fields[-1]] | |
| } | |
| and Python (hope it is idiomatic enough): | |
| import fileinput | |
| import re | |
| ancestors = {} | |
| for line in fileinput.input(): | |
| if re.match(r'\d', line): | |
| fields = line.rstrip("\n").split("\t") | |
| ancestors[fields[0]] = [fields[-1]] | |
| I did several runs with each interpreter, this is a MacBook Pro 13'' | |
| from mid-2009 with a OWC SSD that does about 280 Mb/s. The input | |
| file is geoplanet_places_7.6.0.tsv, included in the zip file | |
| http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in | |
| http://developer.yahoo.com/geo/geoplanet/data/. These are the numbers: | |
| +----------+----------------+--------+--------+---------+---------+ | |
| | Language | Interpreter | I/O | split | map | TOTAL | | |
| +----------+----------------+--------+--------+---------+---------+ | |
| | Ruby | Rubinius HEAD | 1m 7s | 1m 42s | 20m 19s | 23m 8s | | |
| | Ruby | MRI 1.8.7-p334 | 8s | 34s | 7m 8s | 7m 50s | | |
| | Ruby | MRI 1.9.2-p180 | 8s | 12s | 1m 54s | 2m 14s | | |
| | Ruby | JRuby 1.6.1 | 9s | 7s | 1m 8s | 1m 28s | | |
| | Python | CPython 2.7.1 | 23s | 12s | 25s | 1m 0s | | |
| | Perl | Perl 5.12.3 | 4s | 25s | 25s | 54s | | |
| +----------+----------------+---------------------------+---------+ | |
| The "I/O", "split", and "map" columns are simple time splits measured | |
| commenting out their corresponding lines. | |
| The "I/O" column measures the loop including the regexp match and the | |
| chomp in the case of Ruby and Perl. The "split" column measures the | |
| cost of splitting the lines itself, and the "map" column measures new | |
| array instantiatiation and building the hash, the last line in the | |
| loop body. | |
| JRuby was run with the --server flag, see also Charles's comments below. | |
| I suspect there's something going on in Rubinius HEAD that skews the | |
| results, since it is a fast interpreter generally speaking. | |
| I went with Perl in the end for this particular problem. Processing | |
| the original CSV, building the complete ancestors chain for each place, | |
| and printing the original CSV with the ancestry cache as an additional | |
| column, takes about 2 minutes (https://gist.github.com/983977). | |
| The table above should not be extrapolated, it is a very concrete | |
| comparison, but one that was meaningful for the task at hand. |
Assuming our machines aren't too far off in performance. 1m28s is pretty close to my 17s times 5 (for 15_000_000) rows. I think we could probably beat python and perl with some tweaking to the algorithm and to JRuby itself, but at least we're in the same ballpark now.
FWIW, I tried to make some tweaks to the script to help improve perf more on JITing implementations like JRuby and Rubinius:
- Run the data-processing method a few times with smaller number of iterations to allow it to "warm up" in the VM
- Only pay attention to the final "long" run
This appeared to help JRuby a bit, but not by a significant amount. The JVM generally can optimize running code, and running this at the command line causes it to be compiled into JVM bytecode immediately.
I was unable to get Rubinius numbers to improve much even with these changes. After about 15 minutes I gave up on it.
EDIT: I guess it wasn't 15 minutes, and it completed right after I posted this comment. Rubinius clocked in at almost exactly 7 minutes for the 3_000_000 run. That's significantly better than the 23 minutes you have above.
This excludes IO time, so it's benchmarking core class and Ruby execution performance.
Xavier, could you provide a link to the data used for the test? I'd like to run some tests with https://github.com/skaes/rvm-patchsets.
My guess is that the performance of the benchmarks is dominated by GC performance, so I'd like to experiment with a few GC tweaks.
There's definitely a heavy GC component here, so any help in that area could improve results for both 1.8.7 and 1.9. Even on JRuby a substantial heap and some GC tweaking were necessary to get solid perf (for me).
The benchmark/algo could probably also be improved to avoid such high object counts (split is a big part of it here) which could help all impls.
@skaes hey :), the input file is geoplanet_places_7.6.0.tsv, included in the Zip file http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in http://developer.yahoo.com/geo/geoplanet/data/.
@joaquinferrero I've moved the chomp down finally, albeit performance is unaffected because it applies to all lines but one or two, a reader of the code with no knowledge of the data sees a chomp that might be unnecessary. Chomping after the next line is more normal. Thanks!
FYI, this is a really massive amount of data, so make sure the heap is set large enough (-J-Xmx and -J-Xms flags above) and use that ConcMarkSweep flag to take better advantages of multiple cores during GC. It should easily beat any other Ruby impl.