Someone who was in my PDF text extraction session at NICAR 2020 asked how to identify image vs. text PDFs when you have thousands of files and they're a mixture of formats with the end goal of only running OCR software on the image PDFs.
This is how I would approach the problem using command-line tools.
- You’re working on a Mac or Linux machine where you have access to some common command-line utilities such as
findandsed- This should work under the Windows Subshell for Linux under Windows also
- You have
pdftotextinstalled, which we used in the NICAR session.
- Run
pdftotexton all the PDFs (with the help offind) to try to extract text. - Inspect the file sizes of a known image PDF to determine a good size threshold for the text files. Common sense tells us image PDFs have no text to extract (without OCR that is), so the output of
pdftotextshould create text files that are very small, only a few bytes. - Use
findagain to identify the text files that are really small. - Replace
.txtwith.pdfto get the original filename
For each PDF file, run pdftotext on it and save the output to a .txt file:
You can leverage the command find with its -exec option to do this.
find . -iname '*.pdf' -exec pdftotext {} \;
Let’s break this command donw:
find: Name of command-lname ‘*.pdf’: Match files ending in.pdfor.PDF(or.pDF, etc for that matter)-exec pdftotext {} \;: For each matching file, runpdftotexton that file. The{}is a placeholder that gets replaced with the matching filename.
This post has more info on using find with -exec: https://linuxaria.com/howto/linux-shell-how-to-use-the-exec-option-in-find-with-examples.
This will create .txt files for all PDF files in the current directory (and subdirectories) with the text contents. The .txt files correspond to the name of the PDF file but ending in .txt instead of .pdf.
Then look at the text output for a file that you know to be a text PDF and one that you know to be an image PDF.
For example, this is a text PDF:
ls -lh Public\ Health\ Spending\ Brief_2019\ \(1\).txt
Let’s break down this command:
ls: The command name. This just lists a file or files in a directory.-l: Show additional information like file size and timestamp-h: Print numbers in human-readable forms. This is particularly important to be able to differentiate between file sizes that are bytes vs. kilobytes vs. gigabytes.
The output:
-rw-r--r-- 1 ghing 1248616752 11K Mar 23 12:39 Public Health Spending Brief_2019 (1).txt
This is an image PDF:
ls -lh Screen\ Shot\ 2020-03-23\ at\ 12.29.57\ PM.png.txt
The output:
-rw-r--r-- 1 ghing 1248616752 1B Mar 23 12:39 Screen Shot 2020-03-23 at 12.29.57 PM.png.txt
You’ll notice that the extracted text for the text PDF is much larger (11K) than the one for the image PDF (1B).
So, we can once again use find to identify all text files (extracted using pdftotext) that are larger than a certain size. You might have to tweak the size parameter. I’m kind of arbitrarily searching for files smaller than two bytes:
find . -iname '*.txt' -size -2c
Let’s break down that command:
find: The name of the command..: Search starting in the current folder.-iname ‘*.txt’: Find files that end in.txtor.TXT.-inamemeans case-insensitive.-namedoes the same thing but is case sensitive.-size -2c: In addition to the name matching, matching files must be smaller than 2 bytes. Thecspecifies that the unit is bytes, which is kind of counterintuitive. See https://www.ostechnix.com/find-files-bigger-smaller-x-size-linux/ for more on the unit codes.
The output is just the image PDF:
./Screen Shot 2020-03-23 at 12.29.57 PM.png.txt
So, imagining doing this for a whole directory, you’ll get a list of only the files that are likely to contain only scanned images.
Swap out .txt for .pdf and you’ll have a list of the PDF files.
We can actually pipe the previous find command through sed in order to replace .txt with .pdf:
find . -iname '*.txt' -size -2c | sed 's/.txt$/.pdf/'
Note this will be a little wonky if some of your files end in ‘.PDF’ instead of .pdf. There are a number of ways you can work around this, but that's beyond the scope right now.