Skip to content

Instantly share code, notes, and snippets.

@takahashilabo
Last active August 29, 2015 14:02
Show Gist options
  • Save takahashilabo/36029133c1a4c43fc376 to your computer and use it in GitHub Desktop.
Save takahashilabo/36029133c1a4c43fc376 to your computer and use it in GitHub Desktop.
This shell script file for finding the pdf files which probably include no OCRed text.
#!/bin/bash
if [ $# -ne 1 ]; then
echo "Usage: findNoTextPdf.sh path"
exit 1
fi
find $1 -type f -name "*.pdf" | while read myfile
do
pdftotext -l 1 "$myfile" tmp
status=`hexdump -ve '1/1 "%.2x"' tmp | head -c 2`
if [ $status == "0c" ]; then
echo $myfile
fi
done
rm tmp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment