The following setup can search the contents of 7000 .pdf
files in 0.08 seconds on an i7-1260P (less than 12 microseconds per PDF).
To do so, each .pdf
file is first converted to a .txt
file, which is stored next to the .pdf
. If necessary, OCR is performed. This process takes hours.
As soon as all .txt
files are created, they can be quickly searched with ripgrep
.
The solution is a bit hacky, but it's what I use at the moment. I'll probably benchmark it against ripgrep-all
in the future, and maybe switch to that.
E.g. on Debian:
sudo apt update
sudo apt install ripgrep poppler-utils ocrmypdf
The packages are also available on macOS through homebrew. WSL can be used for Windows support.
Note: When using WSL2, access to the Windows file system is very slow (~100 times slower). Move the PDF files to the WSL2 file system to fix this.
Add following aliases to the .zshrc
or .bash_aliases
file:
-
This abomination indexes the PDFs:
alias indexpdfs="rg -0 --files . | rg -0 --null-data -i '\.pdf$' | sed -e \"s/'/\\'\\\\\'\\'/g\" | xargs -0 -I @@ bash -c \"test -e '@@.txt' || { pdftotext -nopgbrk '@@' '@@.txt'; test -s '@@.txt' || { ocrmypdf --force-ocr --output-type=none --sidecar '@@.txt' '@@' /dev/null && touch '@@'; }; }\""
-
This command can be used to list PDFs that are not indexed yet (for debugging):
alias unindexedpdfs="rg -0 --files . | rg -0 --null-data -i '\.pdf$' | sed -e \"s/'/\\'\\\\\'\\'/g\" | xargs -0 -I @@ bash -c \"test -e '@@.txt' || echo '@@'\""
Note: The macOS version of xargs
does not work with long file paths by default. Replace xargs -0
with xargs -0 -S 1000000
to fix this.
Navigate to the folder containing the PDFs, and run the indexing command:
indexpdfs
This will create a .txt
file for every .pdf
in the current folder and its subfolders. All PDFs that are already indexed will be skipped, but the first pass will take a long time.
After indexing, ripgrep
can be used to search as usual, e.g.:
rg -i -C3 "for example"
(-C3
displays the search results with three lines of context.)
Immediately after running indexpdfs
, the command unindexedpdfs
should display no files.