Skip to content

Instantly share code, notes, and snippets.

@fxn
Created May 22, 2011 16:06
Show Gist options
  • Save fxn/4e02331b252b99e9583b to your computer and use it in GitHub Desktop.
Save fxn/4e02331b252b99e9583b to your computer and use it in GitHub Desktop.
Runtime comparison of a simple problem solved in Ruby, Python, and Perl
I wanted to process the Yahoo! GeoPlanet places CSV, which has about
5.7 million rows, to compute the ancestors of every place and cache
them in an additional column. That's related to a post I am writing
about importing this data into Postgres.
The CSV file has the WOEID of each places in the first field, and the
WOEID of their parent in the last field. I wanted to initialize a hash
that mapped the former to an array containing the latter, to build the
ancestor chain of each place there later.
Since this is in the context of a Rails application run by Ruby 1.8.7
I tried first that, but performance was horrible:
ancestors = {}
while line = gets
next unless line =~ /\A\d/
line.chomp!
fields = line.split("\t")
ancestors[fields.first] = [fields.last]
end
Out of curiosity I wrote solutions in Perl
use strict;
my %ancestors = ();
while (my $line = <>) {
next unless $line =~ /^\d/;
chomp $line;
my @fields = split /\t/, $line;
$ancestors{$fields[0]} = [$fields[-1]]
}
and Python (hope it is idiomatic enough):
import fileinput
import re
ancestors = {}
for line in fileinput.input():
if re.match(r'\d', line):
fields = line.rstrip("\n").split("\t")
ancestors[fields[0]] = [fields[-1]]
I did several runs with each interpreter, this is a MacBook Pro 13''
from mid-2009 with a OWC SSD that does about 280 Mb/s. The input
file is geoplanet_places_7.6.0.tsv, included in the zip file
http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in
http://developer.yahoo.com/geo/geoplanet/data/. These are the numbers:
+----------+----------------+--------+--------+---------+---------+
| Language | Interpreter | I/O | split | map | TOTAL |
+----------+----------------+--------+--------+---------+---------+
| Ruby | Rubinius HEAD | 1m 7s | 1m 42s | 20m 19s | 23m 8s |
| Ruby | MRI 1.8.7-p334 | 8s | 34s | 7m 8s | 7m 50s |
| Ruby | MRI 1.9.2-p180 | 8s | 12s | 1m 54s | 2m 14s |
| Ruby | JRuby 1.6.1 | 9s | 7s | 1m 8s | 1m 28s |
| Python | CPython 2.7.1 | 23s | 12s | 25s | 1m 0s |
| Perl | Perl 5.12.3 | 4s | 25s | 25s | 54s |
+----------+----------------+---------------------------+---------+
The "I/O", "split", and "map" columns are simple time splits measured
commenting out their corresponding lines.
The "I/O" column measures the loop including the regexp match and the
chomp in the case of Ruby and Perl. The "split" column measures the
cost of splitting the lines itself, and the "map" column measures new
array instantiatiation and building the hash, the last line in the
loop body.
JRuby was run with the --server flag, see also Charles's comments below.
I suspect there's something going on in Rubinius HEAD that skews the
results, since it is a fast interpreter generally speaking.
I went with Perl in the end for this particular problem. Processing
the original CSV, building the complete ancestors chain for each place,
and printing the original CSV with the ancestry cache as an additional
column, takes about 2 minutes (https://gist.github.com/983977).
The table above should not be extrapolated, it is a very concrete
comparison, but one that was meaningful for the task at hand.
@vijaydev
Copy link

@fxn Will the first and last methods contribute to any significant time, while building the hash?

@foca
Copy link

foca commented May 22, 2011

@vijaydev probably not, as those use Array#at, and not Array#[], which is usually worse (it has to check for the amount of arguments, if you pass ranges or not, etc, while #at just returns the element at the index)

@fxn
Copy link
Author

fxn commented May 22, 2011

@vijaydev I've tried, didn't make a difference. Also tried compiling the regexp in Python with re.compile (suggested by @njr in Twitter), but no measurable difference either, time seems to be dominated by other stuff.

@headius
Copy link

headius commented May 22, 2011

Is JRuby run with --server here? We should generally beat 1.9.2 on execution and core class performance. I'm downloading the data to give it a try myself.

@headius
Copy link

headius commented May 22, 2011

Is this the geoplanet_places_7.6.0.tsv file? Using the Ruby code above?

@headius
Copy link

headius commented May 22, 2011

FWIW, I got some different numbers...

I modified the Ruby script above as follows:

  • Put the body in a method (JRuby optimizes methods better than top-level script...just havent ever bothered to fix that)
  • Limited it to 3_000_000 iterations (I don't have enough memory to process the entire file)
  • Read the full file contents into memory first and iterated over lines using StringIO
  • Timed only the processing (though the read was only a couple seconds)

I also passed some flags to JRuby...

  • --server to use optimizing JVM
  • -J-XX:+UseConcMarkSweepGC because it seemed to be very GC-heavy (lots of objects!)
  • -J-Xmx2000M -J-Xms2000M to lock the heap max and min at 2000M (most I can afford on this machine)

With these settings, JRuby processed those 3_000_000 rows in 17s versus Ruby 1.9.2's 24s.

@fxn
Copy link
Author

fxn commented May 22, 2011

@headius hey, I just tried the --server flag and did a significant difference, it is down to 1m 28s, going to update the gist now. Yes, that's the file BTW. Will reply via twitter also so that people know the update there.

@headius
Copy link

headius commented May 22, 2011

FYI, this is a really massive amount of data, so make sure the heap is set large enough (-J-Xmx and -J-Xms flags above) and use that ConcMarkSweep flag to take better advantages of multiple cores during GC. It should easily beat any other Ruby impl.

@headius
Copy link

headius commented May 22, 2011

Assuming our machines aren't too far off in performance. 1m28s is pretty close to my 17s times 5 (for 15_000_000) rows. I think we could probably beat python and perl with some tweaking to the algorithm and to JRuby itself, but at least we're in the same ballpark now.

@headius
Copy link

headius commented May 22, 2011

FWIW, I tried to make some tweaks to the script to help improve perf more on JITing implementations like JRuby and Rubinius:

  • Run the data-processing method a few times with smaller number of iterations to allow it to "warm up" in the VM
  • Only pay attention to the final "long" run

This appeared to help JRuby a bit, but not by a significant amount. The JVM generally can optimize running code, and running this at the command line causes it to be compiled into JVM bytecode immediately.

I was unable to get Rubinius numbers to improve much even with these changes. After about 15 minutes I gave up on it.

EDIT: I guess it wasn't 15 minutes, and it completed right after I posted this comment. Rubinius clocked in at almost exactly 7 minutes for the 3_000_000 run. That's significantly better than the 23 minutes you have above.

This excludes IO time, so it's benchmarking core class and Ruby execution performance.

@skaes
Copy link

skaes commented May 23, 2011

Xavier, could you provide a link to the data used for the test? I'd like to run some tests with https://github.com/skaes/rvm-patchsets.

My guess is that the performance of the benchmarks is dominated by GC performance, so I'd like to experiment with a few GC tweaks.

@headius
Copy link

headius commented May 23, 2011

There's definitely a heavy GC component here, so any help in that area could improve results for both 1.8.7 and 1.9. Even on JRuby a substantial heap and some GC tweaking were necessary to get solid perf (for me).

The benchmark/algo could probably also be improved to avoid such high object counts (split is a big part of it here) which could help all impls.

@fxn
Copy link
Author

fxn commented May 23, 2011

@skaes hey :), the input file is geoplanet_places_7.6.0.tsv, included in the Zip file http://ydn.zenfs.com/site/geo/geoplanet_data_7.6.0.zip available in http://developer.yahoo.com/geo/geoplanet/data/.

@fxn
Copy link
Author

fxn commented May 23, 2011

@joaquinferrero I've moved the chomp down finally, albeit performance is unaffected because it applies to all lines but one or two, a reader of the code with no knowledge of the data sees a chomp that might be unnecessary. Chomping after the next line is more normal. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment