billdueber’s gists

billdueber / gist:1947347

Created March 1, 2012 04:46

numericID solr fieldtype

	<fieldtype name="numericID" class="solr.TextField"
	positionIncrementGap="1000" omitNorms="true">
	<analyzer>
	<tokenizer class="solr.KeywordTokenizerFactory"/>
	<filter class="solr.PatternReplaceFilterFactory"
	pattern="^.?(\p{N}[\p{N}\-\.]{6,}[xX]?).$"
	replacement="***$1" />
	<filter class="solr.PatternReplaceFilterFactory"
	pattern="^[^\].$" replacement="" />
	<filter class="solr.PatternReplaceFilterFactory"

billdueber / gist:1979628

Created March 5, 2012 17:30

marc_marc4j reader producing ruby-marc recofds

	Testing on an 18K file in both marc21 and marc-xml. Loop looks like:

	reader = MARC::Reader.new(m21file) # or whatever appropriate reader
	reader.each do \|r\|
	t = r['245']['a']
	end

	MARC version is the just-released 0.4.4

	The following numbers are for a run with just enough compatibility to run the above code.

billdueber / pdf_links.rb

Created March 6, 2012 19:28

pdf list of links

	load '/path/to/itextpdf-5.2.0.jar'

	IText = Java::com.itextpdf.text # same some typing

	doc = IText::Document.new(IText::PageSize::LETTER, 50, 50, 50, 50)

	# AddAuthor and addSubject seem to not work, at least for viewing in Preview
	doc.addAuthor "Bill Dueber"
	doc.addSubject "Why are we requiring PDF???"

billdueber / decode_bench_results.txt

Created June 1, 2012 16:23

jruby: weird Benchmark results?

	jruby 1.6.7.2 (ruby-1.9.2-p312) (2012-05-01 26e08ba) (OpenJDK 64-Bit Server VM 1.7.0-u4-b13) [darwin-amd64-java]
	user system total real
	4.650000 0.000000 4.650000 ( 4.650000)

	jruby 1.7.0.preview1 (ruby-1.9.3-p203) (2012-05-19 00c8c98) (OpenJDK 64-Bit Server VM 1.7.0-u4-b13) [darwin-amd64-java]
	user system total real
	12.090000 0.280000 12.370000 ( 5.778000)

billdueber / gist:3213716

Created July 31, 2012 04:44

JRuby JSON-generation slowdown benchmark

	require 'benchmark'
	require 'json'

	puts RUBY_DESCRIPTION

	# This mess is a json representation of a MARC record (format used in libraries and museums)

	m = %Q[{"leader":"01470nam^a22004451^^4500","fields":[{"001":"000000040"},{"005":"19880715000000.0"},{"006":"m^^^^^^^^d^^^^^^^^"},{"007":"cr^bn^---auaua"},{"008":"880715s1968^^^^nyuae^^^^b^^^\|00100^eng^^"},{"010":{"ind1":" ","ind2":" ","subfields":[{"a":"68027371"}]}},{"035":{"ind1":" ","ind2":" ","subfields":[{"a":"(RLIN)MIUG0001728-B"}]}},{"035":{"ind1":" ","ind2":" ","subfields":[{"a":"(CaOTULAS)159818044"}]}},{"035":{"ind1":" ","ind2":" ","subfields":[{"a":"(OCoLC)ocm00001728"}]}},{"040":{"ind1":" ","ind2":" ","subfields":[{"a":"DLC"},{"c":"DLC"},{"d":"MiU"},{"d":"CStRLIN"},{"d":"MiU"}]}},{"050":{"ind1":"0","ind2":" ","subfields":[{"a":"N6350"},{"b":".P4 1968b"}]}},{"082":{"ind1":" ","ind2":" ","subfields":[{"a":"709.03"}]}},{"100":{"ind1":"1","ind2":" ","subfields":[{"a":"Pevsner, Nikolaus,"},{"d":"1902-1983."}]}},{"245":{"ind1":"1","ind2":"0","subfields

billdueber / gist:4484735

Last active December 10, 2015 19:58

Packing a bunch of ids into an encrypted string for use in a URL -- just some experiments. Driven by https://bibwild.wordpress.com/2013/01/07/crazy-use-of-encryption-to-protect-refworks-callback-urls/

	# Packing a bunch of ids into an encrypted string for use in a URL -- just some experiments.
	# Driven by https://bibwild.wordpress.com/2013/01/07/crazy-use-of-encryption-to-protect-refworks-callback-urls/

	# A quick experiment to look at how many ids of various types we can crust into a
	# encrypted string. Obviously we could be more particular about how we pack them in
	# if we know the characteristics of the IDs ahead of time.


	require 'stringio'
	require 'zlib'

billdueber / gist:4596330

Last active December 11, 2015 11:48

File.read vs File.read with size

	require 'benchmark'
	bigfiles = %w[ddd1 ddd2 ddd3] # all copies of the same 6MB file

	puts RUBY_DESCRIPTION
	puts "Bigfile size is ", File.size('ddd1')
	Benchmark.bmbm do \|bm\|
	bm.report("straight read") { bigfiles.each {\|bigfile\| File.read(bigfile) } }
	bm.report("read w/ size") { bigfiles.each {\|bigfile\| File.read(bigfile,File.size(bigfile)) } }
	end

billdueber / solr_text_types.xml

Last active December 18, 2015 03:09

Umich experimental text types for solr

	<!--
	#########################
	TEXT FIELD TYPES
	#########################

	In all cases, we want to perform NFKC unicode normalization,
	case folding, and ASCII-folding (i.e., removal of accents so
	ü => u).

	ICUFoldingFilterFactory will give us all of those things.

billdueber / gist:5824804

Created June 20, 2013 17:30

Simple icu-tokenizer with symbol replacement

	<fieldtype name="text" class="solr.TextField" positionIncrementGap="1000">
	<analyzer type="index">
	<charFilter class="solr.PatternReplaceCharFilterFactory"
	pattern="&" replacement=" and " />
	<charFilter class="solr.PatternReplaceCharFilterFactory"
	pattern="\b([A-Ga-g])[\#♯](\s+\|\Z)" replacement="$1 sharp$2" />
	<charFilter class="solr.PatternReplaceCharFilterFactory"
	pattern="\b([A-Ga-g])\s*[b♭](\s+\|\Z)" replacement="$1 flat$2" />
	<charFilter class="solr.PatternReplaceCharFilterFactory"
	pattern="\b[Cc]\+\+" replacement="cplusplus" />

billdueber / marc_xml_test.rb

Created July 9, 2013 19:27

Compare nokogiri xml parsing vs. marc4j4r xml parsing with a roundtrip through marchash

	require 'marc'
	require 'marc4j4r'
	require 'benchmark'


	iterations = 1

	xmlsourcefile = 'topics.xml' # 18k records as a MARC-XML collection

	puts RUBY_DESCRIPTION

Bill Dueber billdueber