- ImageMagick
C
- kraken
Python
- ocrobin
Python
- ocrodeg
Python
- ocropy
Python
- ocropy2
Python
- ocrorot
Python
- prima-image-lib
C++
- unpaper
C
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
convert -density 300 -depth 8 -alpha Off -limit area 1 foo.pdf foo_%04d.tif |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" encoding="UTF-8"?> | |
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" | |
xmlns:xs="http://www.w3.org/2001/XMLSchema" | |
xmlns:saxon="http://saxon.sf.net/" | |
xmlns:my="foo.bar" | |
exclude-result-prefixes="xs my saxon uuid" | |
xpath-default-namespace="http://www.w3.org/1999/xhtml" | |
version="2.0" | |
xmlns:uuid="java:java.util.UUID"> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
_jpageviewer_jar=~/opt/jpageviewer/JPageViewer.jar | |
if [ -e "$_jpageviewer_jar" ]; then | |
jpageviewer() { | |
# --resolve-dir defaults to the file's directory | |
_jpageviewer_resolve_dir=`dirname "$1"` | |
# ... unless a mets.xml file exists one directory down (OCR-D workspace) | |
if [ -e "$_jpageviewer_resolve_dir"/../mets.xml ]; then | |
_jpageviewer_resolve_dir="$_jpageviewer_resolve_dir"/.. | |
fi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding:utf-8 -*- | |
from itertools import dropwhile | |
import json | |
import re | |
import sys | |
import matplotlib.pyplot as plt | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
FROM python:3.7 | |
RUN apt-get update \ | |
&& apt-get install -y --no-install-recommends libcairo2-dev libgtk-3-bin libgtk-3-dev libglib2.0-dev libgtksourceview-3.0-dev libgirepository1.0-dev gir1.2-webkit2-4.0 pkg-config cmake \ | |
&& pip3 install -U setuptools --use-feature=2020-resolver \ | |
&& pip3 install browse-ocrd --use-feature=2020-resolver | |
ENV GDK_BACKEND broadway | |
ENV BROADWAY_DISPLAY :5 |
To estimate the impact of some changes to dinglehopper I used hyperfine to benchmark the behaviour.
The commands to run a docker container to execute the benchmarks are listed in performance-docker.sh
.
The commands needed to prepare, execute and analyse the benchmark are listed in performance.sh
.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#! /usr/bin/env bash | |
# see https://ryanfb.github.io/etc/2015/03/16/automatic_evaluation_of_ocr_quality.html | |
# using https://github.com/saffsd/langid.py | |
# install with pip install langid and add the scorelines.sh & ocrquality.rb scripts from the blog entry in the same directory | |
# The PDF source files, which start with a DOI, adapt this for your case | |
FILE_SELECTOR=/path/to/source/dir/*.pdf | |
# The path to the directory to which the selected documents should be copied | |
TARGET=/path/to/target/dir |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import {default as fetch} from 'node-fetch'; | |
const { pdf } = require("pdf-to-img"); | |
import {tmpdir} from "os"; | |
import {createWriteStream, createReadStream} from 'fs'; | |
import * as fsp from 'fs/promises' | |
import * as archiver from 'archiver'; | |
import {ArchiverError} from "archiver"; | |
import * as path from "path"; | |
import {Parser, Builder} from "xml2js"; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/perl | |
use strict; | |
use utf8; | |
use XML::LibXML; | |
use XML::Quote; | |
binmode STDOUT, ":utf8"; | |
my $dom=XML::LibXML->load_xml(location=>$ARGV[0]); | |
my $root=$dom->documentElement; |