Skip to content

Instantly share code, notes, and snippets.

@djudd
Created August 4, 2014 05:19
Show Gist options
  • Save djudd/1402751e2928cb8ac788 to your computer and use it in GitHub Desktop.
Save djudd/1402751e2928cb8ac788 to your computer and use it in GitHub Desktop.
Extract text from arbitrary files
#!/bin/bash
if [ -z "$2" ]; then
echo "Usage: ./convert_to_text.sh in.pdf out.txt";
exit 1;
fi
INPUT=$1
OUTPUT=$2
TYPE=`file -b --mime-type "$INPUT"`
if [[ -n `which timeout` ]]; then
TIMEOUT='timeout -k 4.5 4'
elif [[ -n `which gtimeout` ]]; then
TIMEOUT='gtimeout -k 4.5 4'
else
TIMEOUT=''
fi
# file occasionally misidentifies plain text as pascal given the presence of certain keywords :(
if [[ "$TYPE" == "text/plain" || "$TYPE" == "text/x-pascal" ]]; then
cp "$INPUT" "$OUTPUT";
elif [[ "$TYPE" == "application/pdf" && -n `which pdftotext` ]]; then
$TIMEOUT pdftotext -q -enc UTF-8 "$INPUT" "$OUTPUT";
elif [[ "$TYPE" == "application/postscript" && -n `which ps2ascii` ]]; then
$TIMEOUT ps2ascii "$INPUT" "$OUTPUT";
else
if [[ -n `which abiword` ]]; then
$TIMEOUT abiword -t txt "$INPUT" -o "$OUTPUT";
fi
if [[ ! -f "$OUTPUT" ]] && [[ -n `which unoconv` ]]; then
$TIMEOUT unoconv -f txt -o "$OUTPUT" "$INPUT";
fi
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment