My MARC indexing does the following processing (working with marc4j records):
- normal fields. 29 fields that can be described using nothing but the normal solrmarc syntax. This might be extensive (15-20 tags, lookups in either hashes or sets of regexp matches for transformation) but doesn't require custom code. This generic processor is written in Ruby.
- custom fields. 10 (or 14 if it's a serial) fields that require custom code. These are also all Ruby.
- all fields The single "get all the text in all the fields numbered 10 to 999" method. Ruby.
- xml Turn the record into a MARC-XML string. This uses JRuby to get a java.io.StringWriter and javax.xml.transform.stream and then call marc4j's MarcXmlWriter method to write to it. Which, I just looked, isn't exactly how solrmarc does it. I'll have to benchmark it both ways. Both Ruby and Java
- HLB Take a record, find all the callnumbers, normalize them, do a lookup against a set of callnumber ranges, and return all ranges in which each callnumber falls. All in Java (exact same .jar file used in solrmarc and called from JRuby)
The test case is 150,000 records that I just pulled out of a recent dump.
I can't actually run my code with only one thread, due to the way solrj.StreamingUpdateSolrServer (SUSS) works. These numbers represent a single thread doing the read-marc-from-file (Aleph Sequential MARC) and process into a a solr doc, and a second thread whose job is to send stuff from the SUSS queue to solr itself.
Because of this, the cost of sending the documents to solr will be masked (total time added will likely drop to near zero) in my base-case with two threads.
solrmarc is running with the direct-to-disk indexing (not via http). JRuby is using StreamingUpdateSolrServer over HTTP. The indexing process runs on the same machine the solr process is running on.
The 8-thread run is 6 for processing and 2 for sending stuff to Solr (i.e., I passed the number "2" in the number-of-threads slot to the SUSS constructor).
Threads | 1 | 2 | 8 |
---|---|---|---|
solrmarc | 310 | - | - |
jruby | - | 240 | 617
Results are in records/second. Higher is better
JRuby with 2 threads runs here about 75% the speed of solrmarc with a single thread (with, again, the caveat that the send-stuff-to-solr cost is probably almost completely masked).
For the record, when doing a full run, solrmarc generally reports more in the 275 records/second range.
I had an extra call to .flatten.compact which was running on every field as derived. I removed it and changed the basic code to use #uniq!, #flatten! and #compact! instead of their non-bang counterparts.
Threads | 1 | 2 | 8 |
---|---|---|---|
solrmarc | 310 | - | - |
jruby | - | 312 | 803
Now the JRuby code with two threads is on par with the single-threaded solrmarc.
Because the HLB code is (a) all Java, and (b) very expensive, it will tend to mask the differences between the two systems (and, because I'm the only one doing HLB, will make the numbers less valuable to non-me people).
Here's the same run but without any HLB processing
Threads | 1 | 2 | 8 |
---|---|---|---|
solrmarc | 384 | - | - |
jruby | - | 254 | 684
Here JRuby is running at 66% the speed of solrmarc when using just two threads.
It looks like I'm maxing out the jruby speed in some way -- either running out of threads (we have several solr processes going on that machine), maxing out how fast the two threads can push stuff to solr, or maybe even hitting the limit of how fast solr can ingest the stuff (since I'm making Solr do a fair bit of processing during the indexing phase via pattern filters and such).
I indexed a recent full dump of 6,917,324 records and got an overall pace of 838 records/second, with the run taking just under 2.5 hours. That's about 50K records/minute, or a little over 3 million records/hour.
I ran the 2-thread JRuby version and benchmarked how long things took (see above for what each type of processing step entails):
Processing | Seconds | %total |
---|---|---|
Normal fields | 302 | 56.2% |
Custom fields | 73 | 13.6% |
Single "allfields" field | 36 | 6.7% |
HLB | 73 | 13.6% |
to_xml | 53 | 9.9% |
Processing | Seconds | %total |
---|---|---|
Normal fields | 166 | 41.2% |
Custom fields | 72 | 18.1% |
Single "allfields" field | 34 | 8.5% |
HLB | 72 | 18.1% |
to_xml | 54 | 13.6% |
HLB is already pretty damn efficient; I'm not sure how much I could gain there. But the to_xml and the allfields calls are probably ripe for a little optimization.