-
-
Save fxn/4e02331b252b99e9583b to your computer and use it in GitHub Desktop.
I wanted to process the Yahoo! GeoPlanet places CSV, which has about | |
5.7 million rows, to compute the ancestors of every place and cache | |
them in an additional column. That's related to a post I am writing | |
about importing this data into Postgres. | |
The CSV file has the WOEID of each places in the first field, and the | |
WOEID of their parent in the last field. I wanted to initialize a hash | |
that mapped the former to an array containing the latter, to build the | |
ancestor chain of each place there later. | |
Since this is in the context of a Rails application run by Ruby 1.8.7 | |
I tried first that, but performance was horrible: | |
ancestors = {} | |
while line = gets | |
next unless line =~ /\A\d/ | |
line.chomp! | |
fields = line.split("\t") | |
ancestors[fields.first] = [fields.last] | |
end | |
Out of curiosity I wrote solutions in Perl | |
use strict; | |
my %ancestors = (); | |
while (my $line = <>) { | |
next unless $line =~ /^\d/; | |
chomp $line; | |
my @fields = split /\t/, $line; | |
$ancestors{$fields[0]} = [$fields[-1]] | |
} | |
and Python (hope it is idiomatic enough): | |
import fileinput | |
import re | |
ancestors = {} | |
for line in fileinput.input(): | |
if re.match(r'\d', line): | |
fields = line.rstrip("\n").split("\t") | |
ancestors[fields[0]] = [fields[-1]] | |
I did several runs with each interpreter, this is a MacBook Pro 13'' | |
from mid-2009 with a OWC SSD that does about 280 Mb/s. The input | |
file is geoplanet_places_7.6.0.tsv, included in the zip file | |
http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in | |
http://developer.yahoo.com/geo/geoplanet/data/. These are the numbers: | |
+----------+----------------+--------+--------+---------+---------+ | |
| Language | Interpreter | I/O | split | map | TOTAL | | |
+----------+----------------+--------+--------+---------+---------+ | |
| Ruby | Rubinius HEAD | 1m 7s | 1m 42s | 20m 19s | 23m 8s | | |
| Ruby | MRI 1.8.7-p334 | 8s | 34s | 7m 8s | 7m 50s | | |
| Ruby | MRI 1.9.2-p180 | 8s | 12s | 1m 54s | 2m 14s | | |
| Ruby | JRuby 1.6.1 | 9s | 7s | 1m 8s | 1m 28s | | |
| Python | CPython 2.7.1 | 23s | 12s | 25s | 1m 0s | | |
| Perl | Perl 5.12.3 | 4s | 25s | 25s | 54s | | |
+----------+----------------+---------------------------+---------+ | |
The "I/O", "split", and "map" columns are simple time splits measured | |
commenting out their corresponding lines. | |
The "I/O" column measures the loop including the regexp match and the | |
chomp in the case of Ruby and Perl. The "split" column measures the | |
cost of splitting the lines itself, and the "map" column measures new | |
array instantiatiation and building the hash, the last line in the | |
loop body. | |
JRuby was run with the --server flag, see also Charles's comments below. | |
I suspect there's something going on in Rubinius HEAD that skews the | |
results, since it is a fast interpreter generally speaking. | |
I went with Perl in the end for this particular problem. Processing | |
the original CSV, building the complete ancestors chain for each place, | |
and printing the original CSV with the ancestry cache as an additional | |
column, takes about 2 minutes (https://gist.github.com/983977). | |
The table above should not be extrapolated, it is a very concrete | |
comparison, but one that was meaningful for the task at hand. |
@vijaydev probably not, as those use Array#at
, and not Array#[]
, which is usually worse (it has to check for the amount of arguments, if you pass ranges or not, etc, while #at just returns the element at the index)
Is JRuby run with --server here? We should generally beat 1.9.2 on execution and core class performance. I'm downloading the data to give it a try myself.
Is this the geoplanet_places_7.6.0.tsv file? Using the Ruby code above?
FWIW, I got some different numbers...
I modified the Ruby script above as follows:
- Put the body in a method (JRuby optimizes methods better than top-level script...just havent ever bothered to fix that)
- Limited it to 3_000_000 iterations (I don't have enough memory to process the entire file)
- Read the full file contents into memory first and iterated over lines using StringIO
- Timed only the processing (though the read was only a couple seconds)
I also passed some flags to JRuby...
- --server to use optimizing JVM
- -J-XX:+UseConcMarkSweepGC because it seemed to be very GC-heavy (lots of objects!)
- -J-Xmx2000M -J-Xms2000M to lock the heap max and min at 2000M (most I can afford on this machine)
With these settings, JRuby processed those 3_000_000 rows in 17s versus Ruby 1.9.2's 24s.
@headius hey, I just tried the --server flag and did a significant difference, it is down to 1m 28s, going to update the gist now. Yes, that's the file BTW. Will reply via twitter also so that people know the update there.
FYI, this is a really massive amount of data, so make sure the heap is set large enough (-J-Xmx and -J-Xms flags above) and use that ConcMarkSweep flag to take better advantages of multiple cores during GC. It should easily beat any other Ruby impl.
Assuming our machines aren't too far off in performance. 1m28s is pretty close to my 17s times 5 (for 15_000_000) rows. I think we could probably beat python and perl with some tweaking to the algorithm and to JRuby itself, but at least we're in the same ballpark now.
FWIW, I tried to make some tweaks to the script to help improve perf more on JITing implementations like JRuby and Rubinius:
- Run the data-processing method a few times with smaller number of iterations to allow it to "warm up" in the VM
- Only pay attention to the final "long" run
This appeared to help JRuby a bit, but not by a significant amount. The JVM generally can optimize running code, and running this at the command line causes it to be compiled into JVM bytecode immediately.
I was unable to get Rubinius numbers to improve much even with these changes. After about 15 minutes I gave up on it.
EDIT: I guess it wasn't 15 minutes, and it completed right after I posted this comment. Rubinius clocked in at almost exactly 7 minutes for the 3_000_000 run. That's significantly better than the 23 minutes you have above.
This excludes IO time, so it's benchmarking core class and Ruby execution performance.
Xavier, could you provide a link to the data used for the test? I'd like to run some tests with https://github.com/skaes/rvm-patchsets.
My guess is that the performance of the benchmarks is dominated by GC performance, so I'd like to experiment with a few GC tweaks.
There's definitely a heavy GC component here, so any help in that area could improve results for both 1.8.7 and 1.9. Even on JRuby a substantial heap and some GC tweaking were necessary to get solid perf (for me).
The benchmark/algo could probably also be improved to avoid such high object counts (split is a big part of it here) which could help all impls.
@skaes hey :), the input file is geoplanet_places_7.6.0.tsv, included in the Zip file http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in http://developer.yahoo.com/geo/geoplanet/data/.
@joaquinferrero I've moved the chomp down finally, albeit performance is unaffected because it applies to all lines but one or two, a reader of the code with no knowledge of the data sees a chomp that might be unnecessary. Chomping after the next line is more normal. Thanks!
@fxn Will the first and last methods contribute to any significant time, while building the hash?