Diffing PDFs

Recently, I wanted to find the textual differences between two PDFS, in the same way that you would compare plain text files with Git/GitHub. I wanted the nice side-by-side view too, not just the Git diff terminal output.

Edit: After writing this gist, I realized that https://www.diffchecker.com/pdf-compare/ pretty much does what I want. This was still a fun experiment though.

I tried a few different ways of extracting the text (Tesseract OCR, Copy and Pasting all the text), but eventually I found the best solution for me was a tool called textract, which uses pdftotext under the hood. This did the best job, as it didn't have weird misread symbols like OCR and it didn't have extra nonsense copied by C/P.

For my two pdf files, I ran textract to output the extracted text to text files:

textract file1.pdf > file1.txt
textract file2.pdf > file2.txt

If these files were small, I could have just used Github to make a diff (committing one file to a repo/gist, and then committing the other to the same file name, and comparing the commits). However, my files were pretty big so I needed to use a different tool to generate a nice diff.

First, I used git diff to make a diff file:

git diff file1.txt file2.txt > file1_file2.diff

Finally, I installed diff2html-cli using npm (npm install -g diff2html-cli) and generated a nice side-by-side diff viewable in the browser:

diff2html -s side -i file -- file1_file2.diff

The -s flag allows you to choose a style, in this case, side-by-side instead of unified, and the -i file flag tells diff2html to use the provided .diff file.

That's all it took!

kevinlinxc/pdfdiff.md

Diffing PDFs