Skip to content

Instantly share code, notes, and snippets.

@billdueber
Created May 5, 2010 13:16
Show Gist options
  • Save billdueber/390749 to your computer and use it in GitHub Desktop.
Save billdueber/390749 to your computer and use it in GitHub Desktop.

Timing JRuby-based MARC indexing vs solrmarc v2.0

The data and processing

My MARC indexing does the following processing (working with marc4j records):

  • normal fields. 29 fields that can be described using nothing but the normal solrmarc syntax. This might be extensive (15-20 tags, lookups in either hashes or sets of regexp matches for transformation) but doesn't require custom code. This generic processor is written in Ruby.
  • custom fields. 10 (or 14 if it's a serial) fields that require custom code. These are also all Ruby.
  • all fields The single "get all the text in all the fields numbered 10 to 999" method. Ruby.
  • xml Turn the record into a MARC-XML string. This uses JRuby to get a java.io.StringWriter and javax.xml.transform.stream and then call marc4j's MarcXmlWriter method to write to it. Which, I just looked, isn't exactly how solrmarc does it. I'll have to benchmark it both ways. Both Ruby and Java
  • HLB Take a record, find all the callnumbers, normalize them, do a lookup against a set of callnumber ranges, and return all ranges in which each callnumber falls. All in Java (exact same .jar file used in solrmarc and called from JRuby)

The test case is 150,000 records that I just pulled out of a recent dump.

A word on "Single"-threading

I can't actually run my code with only one thread, due to the way solrj.StreamingUpdateSolrServer (SUSS) works. These numbers represent a single thread doing the read-marc-from-file (Aleph Sequential MARC) and process into a a solr doc, and a second thread whose job is to send stuff from the SUSS queue to solr itself.

Because of this, the cost of sending the documents to solr will be masked (total time added will likely drop to near zero) in my base-case with two threads.

Indexing all my fields

solrmarc is running with the direct-to-disk indexing (not via http). JRuby is using StreamingUpdateSolrServer over HTTP. The indexing process runs on the same machine the solr process is running on.

The 8-thread run is 6 for processing and 2 for sending stuff to Solr (i.e., I passed the number "2" in the number-of-threads slot to the SUSS constructor).

Original code

Threads 1 2 8
solrmarc 310 - -
jruby  |    -  |      240 |     617

Results are in records/second. Higher is better

JRuby with 2 threads runs here about 75% the speed of solrmarc with a single thread (with, again, the caveat that the send-stuff-to-solr cost is probably almost completely masked).

For the record, when doing a full run, solrmarc generally reports more in the 275 records/second range.

After optimizing Array methods

I had an extra call to .flatten.compact which was running on every field as derived. I removed it and changed the basic code to use #uniq!, #flatten! and #compact! instead of their non-bang counterparts.

Threads 1 2 8
solrmarc 310 - -
jruby  |    -  |      312   | 803

Now the JRuby code with two threads is on par with the single-threaded solrmarc.

Removing HLB from the indexing

Because the HLB code is (a) all Java, and (b) very expensive, it will tend to mask the differences between the two systems (and, because I'm the only one doing HLB, will make the numbers less valuable to non-me people).

Here's the same run but without any HLB processing

Original code

Threads 1 2 8
solrmarc 384 - -
jruby  |    -  |      254 |     684

Here JRuby is running at 66% the speed of solrmarc when using just two threads.

It looks like I'm maxing out the jruby speed in some way -- either running out of threads (we have several solr processes going on that machine), maxing out how fast the two threads can push stuff to solr, or maybe even hitting the limit of how fast solr can ingest the stuff (since I'm making Solr do a fair bit of processing during the indexing phase via pattern filters and such).

How about for a full run?

I indexed a recent full dump of 6,917,324 records and got an overall pace of 838 records/second, with the run taking just under 2.5 hours. That's about 50K records/minute, or a little over 3 million records/hour.

Where does the time go?

I ran the 2-thread JRuby version and benchmarked how long things took (see above for what each type of processing step entails):

Original code

Processing Seconds %total
Normal fields 302 56.2%
Custom fields 73 13.6%
Single "allfields" field 36 6.7%
HLB 73 13.6%
to_xml 53 9.9%

After optimizing Array methods

Processing Seconds %total
Normal fields 166 41.2%
Custom fields 72 18.1%
Single "allfields" field 34 8.5%
HLB 72 18.1%
to_xml 54 13.6%

HLB is already pretty damn efficient; I'm not sure how much I could gain there. But the to_xml and the allfields calls are probably ripe for a little optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment