Skip to content

Instantly share code, notes, and snippets.

@kevinlinxc
Last active September 10, 2023 04:37
Show Gist options
  • Save kevinlinxc/2a9b7dfcba9a2d916092bf61351df8e1 to your computer and use it in GitHub Desktop.
Save kevinlinxc/2a9b7dfcba9a2d916092bf61351df8e1 to your computer and use it in GitHub Desktop.
How to diff 2 PDFs

Diffing PDFs

Recently, I wanted to find the textual differences between two PDFS, in the same way that you would compare plain text files with Git/GitHub. I wanted the nice side-by-side view too, not just the Git diff terminal output.

Edit: After writing this gist, I realized that https://www.diffchecker.com/pdf-compare/ pretty much does what I want. This was still a fun experiment though.

I tried a few different ways of extracting the text (Tesseract OCR, Copy and Pasting all the text), but eventually I found the best solution for me was a tool called textract, which uses pdftotext under the hood. This did the best job, as it didn't have weird misread symbols like OCR and it didn't have extra nonsense copied by C/P.

For my two pdf files, I ran textract to output the extracted text to text files:

textract file1.pdf > file1.txt
textract file2.pdf > file2.txt

If these files were small, I could have just used Github to make a diff (committing one file to a repo/gist, and then committing the other to the same file name, and comparing the commits). However, my files were pretty big so I needed to use a different tool to generate a nice diff.

First, I used git diff to make a diff file:

git diff file1.txt file2.txt > file1_file2.diff

Finally, I installed diff2html-cli using npm (npm install -g diff2html-cli) and generated a nice side-by-side diff viewable in the browser:

diff2html -s side -i file -- file1_file2.diff

The -s flag allows you to choose a style, in this case, side-by-side instead of unified, and the -i file flag tells diff2html to use the provided .diff file.

That's all it took!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment