Skip to content

Instantly share code, notes, and snippets.

@billdueber
Last active March 31, 2021 20:41
Show Gist options
  • Save billdueber/d0461a6b06271d2764c167bf5b258fbf to your computer and use it in GitHub Desktop.
Save billdueber/d0461a6b06271d2764c167bf5b258fbf to your computer and use it in GitHub Desktop.
Basic path through slip

SLIP flow for normal (non-print-holdings or collection-builder) items

A basic run through of how things move through SLIP.

DB Tables overview

  • slip_rights: (one row per item). A copy-ish of rights_current with additional information about when an item was last updated. Populated/updated from vufind solr.
  • slip_queue: (one row per item-to-update). A list of HTIDs along with slots to hold information about which (if any) process is trying to index the item right now. Populated from slip_rights based on timestamps.
  • slip_indexed: (one row per item). Every htid, its shard, when it was last indexed, and how many times it's been indexed.
  • slip_errors: Lines from slip_queue where indexing failed.
  • slip_*_control, slip_*_tmestamp, etc. (one line, or one line per type of indexing run): On/off flags, when things last ran, etc.

0: update the catalog

For stupid historical reasons, the entire SLIP queuing process is driven by what's in the catalog, as determined by querying the catalog.

To facilitate this, the catalog also indexes, for each record, the last-updated dates of every htitem on that record, as reported by Zephir in a MARC 974 field.

1. rights_j: Update the slip_rights table based on the catalog

slip_rights is a (kind of) copy of rights_current, with one row for each item, updated whenever an item is determined to need indexing.

In addition to rights data, it has two other fields:

  • update_time: (date like 20210201) the last time the item (or its record) were changed according to zephir.
  • insert_time: (timestamp) the time at which this line was last inserted/updated (i.e., DEFAULT = CURRENT_TIMESTAMP). This will be used later to determine which items actually need (re)indexing.

rights_j grabs likely IDs to (re)index from the catalog and updates slip_rights with the update date (from the catalog) and changes the insert_time to NOW.

Psuedocode

last_time_rights_ran = sql("select time from slip_vsolr_timestamp")

vufind_query("ht_id_update:[last_time_rights_ran TO *]").each do |rec|
  rec.hathi_items.each do |item|
    upsert_into_slip_rights(
      id=item.htid, 
      update_time=item.zephir_update_date,
      insert_time = NOW, 
      other_crap)
  end
end

# set this so we know how to query the catalog the next time around
sql("update slip_vsolr_timestamp set time = max_insert_time_in_slip_rights")

2. enqueuer-j: copy stuff from slip_rights to slip_queue

slip_queue holds only rows for things that need indexing. In addition to the htid and shard, it has slots where processes can put data that indicate the item is actually being worked on.

It keeps its last-run-time in the extremely poorly named slip_rights_timestamp table;

last_time_enqueuer_ran = sql("select time from slip_rights_timestamp")

items = sql("select * from slip_rights where insert_time >= last_time_enqueuer_ran")
items.each do |item|
  shard = sql("select shard from slip_indexed where htid=item.htid) || 0
  upsert_into_slip_queue(htid=item.htid, shard=shard)
end

sql("update slip_rights_timestamp set time = NOW")

3. index-j: index the documents

index-j roughly/conceptually does the following (more detail to come):

htid, shard = sql("select htid, shard from slip_queue where pid is NULL")
sql(update slip_queue set 
       pid = $PID,
       host = $HOST,
       proc_status = "indexing")

if shard == 0
  shard = random(12)
end
       
mets = mets_file(htid)

metadata = make_http_call_to_vufind_solr_no_for_real(htid)
other_metadata = mets.metadata_we_want
text = metadata.pages.map {|page| get_page_text(htid, page)}.join(' ')

solr_document = make_solr_doc(metadata, other_metadata, text)

status = http.post(solr_url_for_shard, solr_document)

sql("delete from slip_queue where htid=htid and pid=$PID")

if status == 'ok'
  sql("update slip_indexed set
        shard = shard,
        time = NOW,
        indexed_ct = indexed_ct + 1")
else
  sql("insert into slip_errors values htid, shard, $PID, $HOST, NOW, status")
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment