DEVONthink 3 script to automatically OCR PDFs using a local install of https://ocrmypdf.readthedocs.io/
Easiest is to use homebrew:
brew install ocrmypdf
If you need languages other than English, install additional language pack:
brew install tesseract-lang
- Set
strExportPath
to include theocrmypdf
binary path. Default value is valid for an install with homebrew. - Set the OCRmyPDF parameters you need in
strCmd
, specifically if you want to use other languages, e.g. for French:-l fra
(you can get other language codes withtesseract --list-langs)
. - Finally copy the script file to
~/Library/Application Scripts/com.devon-technologies.think3/Smart Rules
(path for DEVONthink 3 beta)
Create a new smart rule by right-clicking in sidebar > New Smart Rule...
- Name:
PDFs without OCR
- Search in:
Databases
- Search all:
Kind
isPDF/PS
Extension
isPDF document
(required as some .AI files are recognized as PDF kind as well)Word Count
is0
Tag
is notocr_error
(this is how we automatically exclude files which couldn't be OCR'd for some reason)Tag
is notocr_ignore
(this is how we manually exclude files which we don't want to OCR)
- Perform the following actions:
Daily
(as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs) - Action:
Execute Script
-External
-DT3 - Add OCR to PDF
To list files which couldn't be OCR'd for some reason, create another smart rule:
- Name:
OCR errors
- Search in:
Databases
- Search all:
Tag
isocr_error
- Perform the following actions:
Weekly
- Actions:
- Bounce Dock Icon
- Display Notification:
Some PDFs cannot be OCR'd.
First rule PDFs without OCR
will show you all the PDF files which require OCR.
OCR will be triggered every day (early morning for me, when laptop automatically wakes up to backup).
To bypass OCR for some files, add tag ocr_ignore
.
Second rule will show a weekly reminder when there's some file waiting to be checked manually.
To get details about why OCR didn't succeed, try running the ocrmypdf
command manually on the files.
One possible fix is to try to force OCR (try first with --redo-ocr
before doing --force-ocr
).
Otherwise you'll probably have to "fix" the PDF (e.g. extract each page individually and create a new PDF), before OCR'ing it again.
Another topic: here's my setup to automatically import files scanned from my phone (I use Scanbot on iOS).
DEVONthink - Import & Delete.scpt
from the list. I believe this script is installed by DT through their Install Add-Ons > Additional Scripts option.