Bill Dueber billdueber

I upgraded my ruby 1.8 to the latest patchlevel and all of a sudden ruby-marc was super-slow. I found the same thing on 1.9 and in JRuby, so I investigated.

There's a marc.bytes.to_a call inside the loop in Reader#decode. All the fix does is move it outside the loop so it only happens once.

You can see the patch in the "slowreadfix" branch at https://github.com/ruby-marc/ruby-marc/commit/beba83745ebe0848218496e967edd65d632fb01e

As you can see, the speedup is about a factor of five.

Test case is reading in a Marc21 file with about 18K records in it.

	require 'benchmark'
	bigfiles = %w[ddd1 ddd2 ddd3] # all copies of the same 6MB file

	puts RUBY_DESCRIPTION
	puts "Bigfile size is ", File.size('ddd1')
	Benchmark.bmbm do \|bm\|
	bm.report("straight read") { bigfiles.each {\|bigfile\| File.read(bigfile) } }
	bm.report("read w/ size") { bigfiles.each {\|bigfile\| File.read(bigfile,File.size(bigfile)) } }
	end

	# Packing a bunch of ids into an encrypted string for use in a URL -- just some experiments.
	# Driven by https://bibwild.wordpress.com/2013/01/07/crazy-use-of-encryption-to-protect-refworks-callback-urls/

	# A quick experiment to look at how many ids of various types we can crust into a
	# encrypted string. Obviously we could be more particular about how we pack them in
	# if we know the characteristics of the IDs ahead of time.


	require 'stringio'
	require 'zlib'

	require 'benchmark'
	require 'json'

	puts RUBY_DESCRIPTION

	# This mess is a json representation of a MARC record (format used in libraries and museums)

	m = %Q[{"leader":"01470nam^a22004451^^4500","fields":[{"001":"000000040"},{"005":"19880715000000.0"},{"006":"m^^^^^^^^d^^^^^^^^"},{"007":"cr^bn^---auaua"},{"008":"880715s1968^^^^nyuae^^^^b^^^\|00100^eng^^"},{"010":{"ind1":" ","ind2":" ","subfields":[{"a":"68027371"}]}},{"035":{"ind1":" ","ind2":" ","subfields":[{"a":"(RLIN)MIUG0001728-B"}]}},{"035":{"ind1":" ","ind2":" ","subfields":[{"a":"(CaOTULAS)159818044"}]}},{"035":{"ind1":" ","ind2":" ","subfields":[{"a":"(OCoLC)ocm00001728"}]}},{"040":{"ind1":" ","ind2":" ","subfields":[{"a":"DLC"},{"c":"DLC"},{"d":"MiU"},{"d":"CStRLIN"},{"d":"MiU"}]}},{"050":{"ind1":"0","ind2":" ","subfields":[{"a":"N6350"},{"b":".P4 1968b"}]}},{"082":{"ind1":" ","ind2":" ","subfields":[{"a":"709.03"}]}},{"100":{"ind1":"1","ind2":" ","subfields":[{"a":"Pevsner, Nikolaus,"},{"d":"1902-1983."}]}},{"245":{"ind1":"1","ind2":"0","subfields

	jruby 1.6.7.2 (ruby-1.9.2-p312) (2012-05-01 26e08ba) (OpenJDK 64-Bit Server VM 1.7.0-u4-b13) [darwin-amd64-java]
	user system total real
	4.650000 0.000000 4.650000 ( 4.650000)

	jruby 1.7.0.preview1 (ruby-1.9.3-p203) (2012-05-19 00c8c98) (OpenJDK 64-Bit Server VM 1.7.0-u4-b13) [darwin-amd64-java]
	user system total real
	12.090000 0.280000 12.370000 ( 5.778000)

	load '/path/to/itextpdf-5.2.0.jar'

	IText = Java::com.itextpdf.text # same some typing

	doc = IText::Document.new(IText::PageSize::LETTER, 50, 50, 50, 50)

	# AddAuthor and addSubject seem to not work, at least for viewing in Preview
	doc.addAuthor "Bill Dueber"
	doc.addSubject "Why are we requiring PDF???"

	Testing on an 18K file in both marc21 and marc-xml. Loop looks like:

	reader = MARC::Reader.new(m21file) # or whatever appropriate reader
	reader.each do \|r\|
	t = r['245']['a']
	end

	MARC version is the just-released 0.4.4

	The following numbers are for a run with just enough compatibility to run the above code.

	<fieldtype name="numericID" class="solr.TextField"
	positionIncrementGap="1000" omitNorms="true">
	<analyzer>
	<tokenizer class="solr.KeywordTokenizerFactory"/>
	<filter class="solr.PatternReplaceFilterFactory"
	pattern="^.?(\p{N}[\p{N}\-\.]{6,}[xX]?).$"
	replacement="***$1" />
	<filter class="solr.PatternReplaceFilterFactory"
	pattern="^[^\].$" replacement="" />
	<filter class="solr.PatternReplaceFilterFactory"

	require 'parslet'
	require 'pp'

	class AdvParser < Parslet::Parser
	rule(:space) { match['\\s\\t'].repeat(1) } # at least one space/tab
	rule(:space?) { space.maybe } # zero or 1 things that match the 'space' rule

	rule(:startexpr) { str('(') >> space? } # '(' followed by optional space
	rule(:endexpr) { space? >> str(')') }