A basic run through of how things move through SLIP.
- slip_rights: (one row per item). A copy-ish of rights_current with additional information about when an item was last updated. Populated/updated from vufind solr.
- slip_queue: (one row per item-to-update). A list of HTIDs along with slots to hold information about which (if any) process is trying to index the item right now. Populated from slip_rights based on timestamps.
- slip_indexed: (one row per item). Every htid, its shard, when it was last indexed, and how many times it's been indexed.
- slip_errors: Lines from slip_queue where indexing failed.
- slip_*_control, slip_*_tmestamp, etc. (one line, or one line per type of indexing run): On/off flags, when things last ran, etc.
For stupid historical reasons, the entire SLIP queuing process is driven by what's in the catalog, as determined by querying the catalog.
To facilitate this, the catalog also indexes, for each record, the last-updated dates of every htitem on that record, as reported by Zephir in a MARC 974 field.
slip_rights is a (kind of) copy of rights_current, with one row for each item, updated whenever an item is determined to need indexing.
In addition to rights data, it has two other fields:
- update_time: (date like 20210201) the last time the item (or its record) were changed according to zephir.
- insert_time: (timestamp) the time at which this line was last
inserted/updated (i.e.,
DEFAULT = CURRENT_TIMESTAMP
). This will be used later to determine which items actually need (re)indexing.
rights_j grabs likely IDs to (re)index from the catalog and updates
slip_rights with the update date (from the catalog) and changes the
insert_time to NOW
.
Psuedocode
last_time_rights_ran = sql("select time from slip_vsolr_timestamp")
vufind_query("ht_id_update:[last_time_rights_ran TO *]").each do |rec|
rec.hathi_items.each do |item|
upsert_into_slip_rights(
id=item.htid,
update_time=item.zephir_update_date,
insert_time = NOW,
other_crap)
end
end
# set this so we know how to query the catalog the next time around
sql("update slip_vsolr_timestamp set time = max_insert_time_in_slip_rights")
slip_queue holds only rows for things that need indexing. In addition to the htid and shard, it has slots where processes can put data that indicate the item is actually being worked on.
It keeps its last-run-time in the extremely poorly named slip_rights_timestamp table;
last_time_enqueuer_ran = sql("select time from slip_rights_timestamp")
items = sql("select * from slip_rights where insert_time >= last_time_enqueuer_ran")
items.each do |item|
shard = sql("select shard from slip_indexed where htid=item.htid) || 0
upsert_into_slip_queue(htid=item.htid, shard=shard)
end
sql("update slip_rights_timestamp set time = NOW")
index-j roughly/conceptually does the following (more detail to come):
htid, shard = sql("select htid, shard from slip_queue where pid is NULL")
sql(update slip_queue set
pid = $PID,
host = $HOST,
proc_status = "indexing")
if shard == 0
shard = random(12)
end
mets = mets_file(htid)
metadata = make_http_call_to_vufind_solr_no_for_real(htid)
other_metadata = mets.metadata_we_want
text = metadata.pages.map {|page| get_page_text(htid, page)}.join(' ')
solr_document = make_solr_doc(metadata, other_metadata, text)
status = http.post(solr_url_for_shard, solr_document)
sql("delete from slip_queue where htid=htid and pid=$PID")
if status == 'ok'
sql("update slip_indexed set
shard = shard,
time = NOW,
indexed_ct = indexed_ct + 1")
else
sql("insert into slip_errors values htid, shard, $PID, $HOST, NOW, status")
end