Skip to content

Instantly share code, notes, and snippets.

@ebenenglish
Created March 7, 2018 21:21
Show Gist options
  • Save ebenenglish/e9b381ba8867b383b34b16ac1c9635e7 to your computer and use it in GitHub Desktop.
Save ebenenglish/e9b381ba8867b383b34b16ac1c9635e7 to your computer and use it in GitHub Desktop.
compares performance of parsing word-coordinate data in different JSON source formats

Overview

For each of these JSON word-coordinate file formats of the same OCR word-coordinate data, representing a single newspaper page:

  1. Open-ONI (58.9 kB)
  2. IIIF Annotation List (826.4 kB)

I ran a test in the Rails console, benchmarking the time needed to:

  1. load the source file into memory
  2. parse as JSON
  3. find nodes matching a search term ("October")
  4. format the coordinate data as a hash matching an oa:Annotation
  5. return an array of hashes that could be used for an IIIF Content Search response

1. Open-ONI

Benchmark.ms do
  file = File.read("open-oni_OCR-word-coordinates.json")
  ocr = JSON.parse(file)
  term = "October"
  term_for_match = term.downcase
  matches = ocr["coords"].select { |k,v| k.downcase =~ /#{term_for_match}/ }
  resources = []
  matches.each do |k,v|
    v.each do |coords_array|
      resource_hash = {}
      resource_hash["@id"] = "http://example.org/identifier/annotation/anno-line"
      resource_hash["@type"] = "oa:Annotation"
      resource_hash["motivation"] = "sc:painting"
      resource_hash["resource"] = { "@type" => "cnt:ContentAsText", "chars" => k}
      resource_hash["on"] = "http://example.org/identifier/canvas1#xywh=#{coords_array.join(',')}"
      resources << resource_hash
    end
  end
  resources
end

Total time (ms): 3.73

2. IIIF Annotation List

Benchmark.ms do
  file = File.read("open-oni_OCR-word-coordinates_iiif-anno-list.json")
  ocr = JSON.parse(file)
  term = "October"
  term_for_match = term.downcase
  matches = ocr["resources"].select { |anno| anno["resource"]["chars"].downcase =~ /#{term_for_match}/ }
  matches.each do |resource|
    resource["@id"] = "http://example.org/identifier/annotation/anno-line"
  end
end

Total time (ms): 10.08

Conclusion

I dug a littler deeper into the time needed for each part of the processing, and the biggest differences are in loading/parsing the file (Open-ONI: ~1.72ms; IIIF: ~6.33ms) and searching for terms (Open-ONI: ~1.76ms; IIIF: ~4.73ms).

It's worth noting that the source IIIF file (created using NLW's ALTO-to-IIIF XSLT) was missing @id values for each resource in the annotation list. The addition of these values would make the file a bit larger in size, and probably increase the time needed to load/parse the file.

Overall, while Open-ONI is 2.7x faster, the total difference of 6.35ms is not especially alarming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment