Bill Dueber billdueber

Timing JRuby-based MARC indexing vs solrmarc v2.0

My MARC indexing does the following processing (working with marc4j records):

normal fields. 29 fields that can be described using nothing but the normal solrmarc syntax. This might be extensive (15-20 tags, lookups in either hashes or sets of regexp matches for transformation) but doesn't require custom code. This generic processor is written in Ruby.
custom fields. 10 (or 14 if it's a serial) fields that require custom code. These are also all Ruby.
all fields The single "get all the text in all the fields numbered 10 to 999" method. Ruby.
xml Turn the record into a MARC-XML string. This uses JRuby to get a java.io.StringWriter and javax.xml.transform.stream and then call marc4j's MarcXmlWriter method to write to it. Which, I just looked, isn't exactly how solrmarc does it. I'll have to benchmark it both ways. Both Ruby and Java

	use strict;
	use JSON;
	use Data::Dumper;

	my $file = 'gistfile1.txt';
	open(INFILE, $file) \|\| die "Can't open '$file': $!";
	my $i = 0;
	my %elements;

	# Example of pushing stuff to solr with solrj in jruby

	require 'rubygems'

	# Load any .jar files your want with "require '../path/to/jarfile.jar'"
	# For both marc4j4r and jruby_streaming_update_solr_server, if you load the
	# appropriate jar first that's the version that will be used. If not,
	# we fall back on the one shipped with the gem

	# require '../jars/myjavacode.jar'

	Just a quick place to put this that's better than IRC:

	Suppose mm="<2 -1"

	As implemented now, the search
	dog cat =>
	dog AND cat

	Likewise
	dog cat -mouse =>

	$:.unshift 'lib'
	require 'marc'
	require 'benchmark'
	require 'profiler'

	tags = ['001','005', '100','110','111','240','243','245', /^6[0-9][0-9]$/, '700', '710', '711']


	rec = MARC::Reader.new('batch.dat').first

	# The current version. Using self.inxex(field) makes this O(n^2)!

	# Rebuild the HashWithChecksumAttribute with the current
	# values of the fields Array
	def reindex
	@tags = {}
	self.each do \|field\|
	@tags[field.tag] \|\|= []
	@tags[field.tag] << self.index(field) ##### AAAAAAAHHHHHHHH ####
	end

	module MARC

	# Simply what the class name says.
	# The checksum is used to see if the FieldMap's array has changed.
	class HashWithChecksumAttribute < Hash
	attr_accessor :checksum
	end

	# The FieldMap is an Array of DataFields and Controlfields.
	# It also contains a HashWithChecksumAttribute with a Hash-based

	# Code to benchmark various serializations of MARC records using ruby-marc
	# Not included is XML -- serialization using ruby-marc is ridiculously slow and the # filesizes are bigger than anything else. Even with the lib-xml reader,
	# deserialization is also relatively slow
	#
	# I didn't bother to benchmark json/pure in later runs because it's just so damn
	# slow that it would never be a good choice.
	#
	# My results can be found at http://robotlibrarian.billdueber.com/sizespeed-of-various-marc-serializations-using-ruby-marc/

	require 'marc'

	require 'rubygems'
	require 'rdf'
	require 'threach'


	(1..10).threach(3) do \|c\|
	u = RDF::URI.new("http://example.org/#{c}/"); puts u.to_s
	end

	require 'rubygems'
	require 'marc'
	require 'yajl'
	require 'benchmark'

	iterations = 5

	xmlsourcefile = 'topics.xml' # 18k records as a MARC-XML collection
	jsonsourcefile = 'topics.ndj' # Same records as newline-delimited marc-in-json