Skip to content

Instantly share code, notes, and snippets.

View billdueber's full-sized avatar

Bill Dueber billdueber

View GitHub Profile
@billdueber
billdueber / marcquery.rb
Created July 15, 2013 19:06
Simple parslet parser for same MARC field (not subfield) query string
require 'parslet'
# A complex field-selection syntax to get MARC fields. Something I'm messing around
# with for a marc indexing process I'm thinking of building to replace marcspec
#
#
# spec := <tag>
# <tag>!<ind><ind>
# tag := '245' # literal string
# := '6##' # use hashes to mean "any character"
@billdueber
billdueber / marc2solr_lessons.adoc
Last active December 19, 2015 21:48
Lessons learned from my marc2solr stuff

Lessons learned from marc2solr

Things I did wrong

These are the things off the top of my head that drive me crazy and/or that I’ve had to work around. I’m sure there are more that I’ll come up with later.

The fundamental problem, it feels to me, is that the (equivalent of the) MARC::Reader.each loop is hidden. Pretty much all the rest of these problems flow from that. Basically, I want to give up on the idea of hiding the primary loop from the user, and just assume the user is both a programmer and a non-idiot.

@billdueber
billdueber / marc2solr_sample_log.txt
Created July 22, 2013 15:37
Sample log lines from my code
# Not saying this is optimal, just what I currently do.
# Start out by logging what the hell we're doing: what config files are loaded, where we're sending documents, etc.
INFO 08:48:20 1252 ROOT Loading files in /l/solr-vufind/apps/marc2solr_example/umich/lib
INFO 08:48:23 4113 MARC2Solr.Conf Set suss url to http://localhost:8024/solr/biblio
INFO 08:48:23 4114 MARC2Solr.Conf Using 3 threads for the suss
INFO 08:48:24 4334 ROOT Using 4 threads; activiating threach
INFO 08:48:24 4335 ROOT Indexing file /l/solr-vufind/data/vufind_full_20130715.seq.gz
INFO 08:48:24 4335 MARC2Solr.Conf Sniffed marc file type as seq
@billdueber
billdueber / marc4j_jruby.rb
Last active December 20, 2015 04:29
Using marc4j from jruby
# I just nabbed the source of marc4j and built it with "ant jar"
require 'marc4j-2.5.1-beta.jar'
# Conveniently add Enumerable to the reader interface so I can get #each, #each_with_index, etc.
# This would be automatic if MarcReader were specified as an iterable, as per a recent github issue
# on the marc4j repo (https://github.com/marc4j/marc4j/issues/11)
module org.marc4j::MarcReader
include Enumerable
@billdueber
billdueber / _UPDATE: MY BAD_
Last active December 23, 2015 00:29
jruby +indy object creation much slower
indy slowdown was due to me havint JRUBY_OPTS include -J-XX:+TieredCompilation and
-J-XX:TieredStopAtLevel=1, supposedly to make startup faster. Removing them
removes the performance issue. Jira ticket closed.
@billdueber
billdueber / extractor_spec.rb
Created September 27, 2013 15:15
partial implementation of MarcExtractor specset and spec objects
module Traject
class MarcExtractor
# A set of Spec object, with knowlege about the collection as a whole
class SpecSet
attr_reader :interesting_tags, :options
def initialize(opts = {})
@specs = {}

Subject: ANNOUNCEMENT: Traject MARC->Solr indexer beta release

Jonathan Rochkind (Johns Hopkins), along with Bill Dueber (University of Michigan), is happy to announce a first beta release of "traject," a framework for indexing MARC data to Solr.

traject, in the vein of solrmarc, allows you to define your indexing rules using simple macro and translation files. However, traject runs under JRuby and is "ruby all the way down," so you can easily provide additional logic by simply requring ruby files.

traject is currently in a beta release, but is already being used in production to generated the HathiTrust Catalog (http://www.hathitrust.org/). traject was developed under a test-first mentality and has undergone both continuous integration and an extensive benchmarking/profiling period to keep it fast.

You can view the code[1] on github, and easily install it as a (jruby) gem using "gem install traject".

@billdueber
billdueber / rjgit_snippets.rb
Last active December 28, 2015 19:09
Playing around with JGit in JRuby
require 'jgit-3.1.0.jar'
dirpath = '.'
# get the repo
dir = java.io::File.new(dirpath)
frb = org.eclipse.jgit.storage.file.FileRepositoryBuilder.new
@billdueber
billdueber / thread_pool_failure.rb
Last active August 29, 2015 14:02
Demonstration of MRI thread pools failing to do all their work
require 'concurrent'
require 'thread'
# A simple safe array implementation.
class SafeArray
def initialize
@write_mutex = Mutex.new
@arr = []
end
def <<(x)
@billdueber
billdueber / _setup.sh
Last active August 29, 2015 14:06
Lotus::Utils JRuby-Head failures
$ git clone http://github.com/lotus/utils
Cloning into 'utils'...
remote: Counting objects: 830, done.
remote: Total 830 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (830/830), 137.72 KiB | 0 bytes/s, done.
Resolving deltas: 100% (441/441), done.
Checking connectivity... done.
$ cd utils
$ chruby jruby_head