Skip to content

Instantly share code, notes, and snippets.

@Iksas
Last active November 7, 2024 14:27
Show Gist options
  • Save Iksas/1d83a0b2a5b32e679b54335084e8d34d to your computer and use it in GitHub Desktop.
Save Iksas/1d83a0b2a5b32e679b54335084e8d34d to your computer and use it in GitHub Desktop.
Full-text search for PDFs

Fast PDF full-text search

The following setup can search the contents of 7000 .pdf files in 0.08 seconds on an i7-1260P (less than 12 microseconds per PDF).

To do so, each .pdf file is first converted to a .txt file, which is stored next to the .pdf. If necessary, OCR is performed. This process takes hours.

As soon as all .txt files are created, they can be quickly searched with ripgrep.

The solution is a bit hacky, but it's what I use at the moment. I'll probably benchmark it against ripgrep-all in the future, and maybe switch to that.

Install required utilities

E.g. on Debian:

sudo apt update
sudo apt install ripgrep poppler-utils ocrmypdf

The packages are also available on macOS through homebrew. WSL can be used for Windows support.

Note: When using WSL2, access to the Windows file system is very slow (~100 times slower). Move the PDF files to the WSL2 file system to fix this.

Set up the indexing commands

Add following aliases to the .zshrc or .bash_aliases file:

  • This abomination indexes the PDFs:

    alias indexpdfs="rg -0 --files . | rg -0 --null-data -i '\.pdf$' | sed -e \"s/'/\\'\\\\\'\\'/g\" | xargs -0 -I @@ bash -c \"test -e '@@.txt' || { pdftotext -nopgbrk '@@' '@@.txt'; test -s '@@.txt' || { ocrmypdf --force-ocr --output-type=none --sidecar '@@.txt' '@@' /dev/null && touch '@@'; }; }\""

  • This command can be used to list PDFs that are not indexed yet (for debugging):

    alias unindexedpdfs="rg -0 --files . | rg -0 --null-data -i '\.pdf$' | sed -e \"s/'/\\'\\\\\'\\'/g\" | xargs -0 -I @@ bash -c \"test -e '@@.txt' || echo '@@'\""

Note: The macOS version of xargs does not work with long file paths by default. Replace xargs -0 with xargs -0 -S 1000000 to fix this.

Usage

Navigate to the folder containing the PDFs, and run the indexing command:

indexpdfs

This will create a .txt file for every .pdf in the current folder and its subfolders. All PDFs that are already indexed will be skipped, but the first pass will take a long time.

After indexing, ripgrep can be used to search as usual, e.g.:

rg -i -C3 "for example"

(-C3 displays the search results with three lines of context.)

Immediately after running indexpdfs, the command unindexedpdfs should display no files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment