Skip to content

Instantly share code, notes, and snippets.

View billdueber's full-sized avatar

Bill Dueber billdueber

View GitHub Profile
@billdueber
billdueber / gist:1947347
Created March 1, 2012 04:46
numericID solr fieldtype
<fieldtype name="numericID" class="solr.TextField"
positionIncrementGap="1000" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="^.*?(\p{N}[\p{N}\-\.]{6,}[xX]?).*$"
replacement="***$1" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="^[^\*].*$" replacement="" />
<filter class="solr.PatternReplaceFilterFactory"
@billdueber
billdueber / gist:1979628
Created March 5, 2012 17:30
marc_marc4j reader producing ruby-marc recofds
Testing on an 18K file in both marc21 and marc-xml. Loop looks like:
reader = MARC::Reader.new(m21file) # or whatever appropriate reader
reader.each do |r|
t = r['245']['a']
end
MARC version is the just-released 0.4.4
The following numbers are for a run with just enough compatibility to run the above code.
@billdueber
billdueber / pdf_links.rb
Created March 6, 2012 19:28
pdf list of links
@billdueber
billdueber / decode_bench_results.txt
Created June 1, 2012 16:23
jruby: weird Benchmark results?
jruby 1.6.7.2 (ruby-1.9.2-p312) (2012-05-01 26e08ba) (OpenJDK 64-Bit Server VM 1.7.0-u4-b13) [darwin-amd64-java]
user system total real
4.650000 0.000000 4.650000 ( 4.650000)
jruby 1.7.0.preview1 (ruby-1.9.3-p203) (2012-05-19 00c8c98) (OpenJDK 64-Bit Server VM 1.7.0-u4-b13) [darwin-amd64-java]
user system total real
12.090000 0.280000 12.370000 ( 5.778000)
@billdueber
billdueber / gist:3213716
Created July 31, 2012 04:44
JRuby JSON-generation slowdown benchmark
require 'benchmark'
require 'json'
puts RUBY_DESCRIPTION
# This mess is a json representation of a MARC record (format used in libraries and museums)
m = %Q[{"leader":"01470nam^a22004451^^4500","fields":[{"001":"000000040"},{"005":"19880715000000.0"},{"006":"m^^^^^^^^d^^^^^^^^"},{"007":"cr^bn^---auaua"},{"008":"880715s1968^^^^nyuae^^^^b^^^|00100^eng^^"},{"010":{"ind1":" ","ind2":" ","subfields":[{"a":"68027371"}]}},{"035":{"ind1":" ","ind2":" ","subfields":[{"a":"(RLIN)MIUG0001728-B"}]}},{"035":{"ind1":" ","ind2":" ","subfields":[{"a":"(CaOTULAS)159818044"}]}},{"035":{"ind1":" ","ind2":" ","subfields":[{"a":"(OCoLC)ocm00001728"}]}},{"040":{"ind1":" ","ind2":" ","subfields":[{"a":"DLC"},{"c":"DLC"},{"d":"MiU"},{"d":"CStRLIN"},{"d":"MiU"}]}},{"050":{"ind1":"0","ind2":" ","subfields":[{"a":"N6350"},{"b":".P4 1968b"}]}},{"082":{"ind1":" ","ind2":" ","subfields":[{"a":"709.03"}]}},{"100":{"ind1":"1","ind2":" ","subfields":[{"a":"Pevsner, Nikolaus,"},{"d":"1902-1983."}]}},{"245":{"ind1":"1","ind2":"0","subfields
@billdueber
billdueber / gist:4484735
Last active December 10, 2015 19:58
Packing a bunch of ids into an encrypted string for use in a URL -- just some experiments. Driven by https://bibwild.wordpress.com/2013/01/07/crazy-use-of-encryption-to-protect-refworks-callback-urls/
# Packing a bunch of ids into an encrypted string for use in a URL -- just some experiments.
# Driven by https://bibwild.wordpress.com/2013/01/07/crazy-use-of-encryption-to-protect-refworks-callback-urls/
# A quick experiment to look at how many ids of various types we can crust into a
# encrypted string. Obviously we could be more particular about how we pack them in
# if we know the characteristics of the IDs ahead of time.
require 'stringio'
require 'zlib'
@billdueber
billdueber / gist:4596330
Last active December 11, 2015 11:48
File.read vs File.read with size
require 'benchmark'
bigfiles = %w[ddd1 ddd2 ddd3] # all copies of the same 6MB file
puts RUBY_DESCRIPTION
puts "Bigfile size is ", File.size('ddd1')
Benchmark.bmbm do |bm|
bm.report("straight read") { bigfiles.each {|bigfile| File.read(bigfile) } }
bm.report("read w/ size") { bigfiles.each {|bigfile| File.read(bigfile,File.size(bigfile)) } }
end
@billdueber
billdueber / solr_text_types.xml
Last active December 18, 2015 03:09
Umich experimental text types for solr
<!--
#########################
TEXT FIELD TYPES
#########################
In all cases, we want to perform NFKC unicode normalization,
case folding, and ASCII-folding (i.e., removal of accents so
ü => u).
ICUFoldingFilterFactory will give us *all* of those things.
@billdueber
billdueber / gist:5824804
Created June 20, 2013 17:30
Simple icu-tokenizer with symbol replacement
<fieldtype name="text" class="solr.TextField" positionIncrementGap="1000">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&amp;" replacement=" and " />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b([A-Ga-g])[\#♯](\s+|\Z)" replacement="$1 sharp$2" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b([A-Ga-g])\s*[b♭](\s+|\Z)" replacement="$1 flat$2" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b[Cc]\+\+" replacement="cplusplus" />
@billdueber
billdueber / marc_xml_test.rb
Created July 9, 2013 19:27
Compare nokogiri xml parsing vs. marc4j4r xml parsing with a roundtrip through marchash
require 'marc'
require 'marc4j4r'
require 'benchmark'
iterations = 1
xmlsourcefile = 'topics.xml' # 18k records as a MARC-XML collection
puts RUBY_DESCRIPTION