For each of these JSON word-coordinate file formats of the same OCR word-coordinate data, representing a single newspaper page:
- Open-ONI (58.9 kB)
- IIIF Annotation List (826.4 kB)
I ran a test in the Rails console, benchmarking the time needed to:
- load the source file into memory
- parse as JSON
- find nodes matching a search term ("October")
- format the coordinate data as a hash matching an
oa:Annotation
- return an array of hashes that could be used for an IIIF Content Search response
Benchmark.ms do
file = File.read("open-oni_OCR-word-coordinates.json")
ocr = JSON.parse(file)
term = "October"
term_for_match = term.downcase
matches = ocr["coords"].select { |k,v| k.downcase =~ /#{term_for_match}/ }
resources = []
matches.each do |k,v|
v.each do |coords_array|
resource_hash = {}
resource_hash["@id"] = "http://example.org/identifier/annotation/anno-line"
resource_hash["@type"] = "oa:Annotation"
resource_hash["motivation"] = "sc:painting"
resource_hash["resource"] = { "@type" => "cnt:ContentAsText", "chars" => k}
resource_hash["on"] = "http://example.org/identifier/canvas1#xywh=#{coords_array.join(',')}"
resources << resource_hash
end
end
resources
end
Total time (ms): 3.73
Benchmark.ms do
file = File.read("open-oni_OCR-word-coordinates_iiif-anno-list.json")
ocr = JSON.parse(file)
term = "October"
term_for_match = term.downcase
matches = ocr["resources"].select { |anno| anno["resource"]["chars"].downcase =~ /#{term_for_match}/ }
matches.each do |resource|
resource["@id"] = "http://example.org/identifier/annotation/anno-line"
end
end
Total time (ms): 10.08
I dug a littler deeper into the time needed for each part of the processing, and the biggest differences are in loading/parsing the file (Open-ONI: ~1.72ms
; IIIF: ~6.33ms
) and searching for terms (Open-ONI: ~1.76ms
; IIIF: ~4.73ms
).
It's worth noting that the source IIIF file (created using NLW's ALTO-to-IIIF XSLT) was missing @id
values for each resource in the annotation list. The addition of these values would make the file a bit larger in size, and probably increase the time needed to load/parse the file.
Overall, while Open-ONI is 2.7x faster, the total difference of 6.35ms is not especially alarming.