Skip to content

Instantly share code, notes, and snippets.

@julienma
Last active September 24, 2024 16:49
Show Gist options
  • Save julienma/e1e5af7d0585aa46ca1d20c9bf14ac06 to your computer and use it in GitHub Desktop.
Save julienma/e1e5af7d0585aa46ca1d20c9bf14ac06 to your computer and use it in GitHub Desktop.
OCR PDFs in DEVONthink 3 with jbarlow83/OCRmyPDF

DEVONthink 3 script to automatically OCR PDFs using a local install of https://ocrmypdf.readthedocs.io/

Install OCRmyPDF

Easiest is to use homebrew:

brew install ocrmypdf

If you need languages other than English, install additional language pack:

brew install tesseract-lang

Customize the .scpt script

  • Set strExportPath to include the ocrmypdf binary path. Default value is valid for an install with homebrew.
  • Set the OCRmyPDF parameters you need in strCmd, specifically if you want to use other languages, e.g. for French: -l fra (you can get other language codes with tesseract --list-langs).
  • Finally copy the script file to ~/Library/Application Scripts/com.devon-technologies.think3/Smart Rules (path for DEVONthink 3 beta)

Create smart rules in DEVONthink

Automatic OCR

Create a new smart rule by right-clicking in sidebar > New Smart Rule...

  • Name: PDFs without OCR
  • Search in: Databases
  • Search all:
    • Kind is PDF/PS
    • Extension is PDF document (required as some .AI files are recognized as PDF kind as well)
    • Word Count is 0
    • Tag is not ocr_error (this is how we automatically exclude files which couldn't be OCR'd for some reason)
    • Tag is not ocr_ignore (this is how we manually exclude files which we don't want to OCR)
  • Perform the following actions: Daily (as I always keep DT open, this is how I make sure it's done automatically. But feel free to adapt it to your needs)
  • Action: Execute Script - External - DT3 - Add OCR to PDF

OCR failures

To list files which couldn't be OCR'd for some reason, create another smart rule:

  • Name: OCR errors
  • Search in: Databases
  • Search all:
    • Tag is ocr_error
  • Perform the following actions: Weekly
  • Actions:
    • Bounce Dock Icon
    • Display Notification: Some PDFs cannot be OCR'd.

Usage

Automatic OCR

First rule PDFs without OCR will show you all the PDF files which require OCR. OCR will be triggered every day (early morning for me, when laptop automatically wakes up to backup).

To bypass OCR for some files, add tag ocr_ignore.

OCR failures

Second rule will show a weekly reminder when there's some file waiting to be checked manually. To get details about why OCR didn't succeed, try running the ocrmypdf command manually on the files.

One possible fix is to try to force OCR (try first with --redo-ocr before doing --force-ocr). Otherwise you'll probably have to "fix" the PDF (e.g. extract each page individually and create a new PDF), before OCR'ing it again.

-- Script for DEVONthink 3
-- Run OCRmyPDF on PDFs without OCR
-- Requires https://github.com/jbarlow83/OCRmyPDF to be installed e.g. with brew
on performSmartRule(theRecords)
tell application id "DNtp"
set strExportPath to "PATH=/opt/homebrew/bin:$PATH "
set intRecordsCount to count of theRecords
show progress indicator "Adding OCR to PDF..." steps intRecordsCount
repeat with theRecord in theRecords
try
step progress indicator filename of theRecord as string
set strRecordPath to quoted form of (path of theRecord as string)
set strCmd to strExportPath & "ocrmypdf --skip-text -l fra --rotate-pages --deskew --clean" & space & strRecordPath & space & strRecordPath
do shell script strCmd
on error error_message number error_number
set tags of theRecord to (tags of theRecord) & "ocr_error"
if the error_number is not -128 then display notification error_message with title "Error with OCR" subtitle (filename of theRecord as string)
end try
end repeat
hide progress indicator
end tell
end performSmartRule
@julienma
Copy link
Author

Another topic: here's my setup to automatically import files scanned from my phone (I use Scanbot on iOS).

  • Create an "inbox" folder on some cloud service (I use Boxcryptor on iCloud, so files are always kept private)
  • On the computer which runs DEVONthink, open this folder in your Finder. Right-click > Services > Folders Action Setup...
  • Enable Folder Actions, make sure your "inbox" folder is enabled, and add script DEVONthink - Import & Delete.scpt from the list. I believe this script is installed by DT through their Install Add-Ons > Additional Scripts option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment