Skip to content

Instantly share code, notes, and snippets.

@dentarg
Created July 11, 2012 13:39
Show Gist options
  • Save dentarg/3090424 to your computer and use it in GitHub Desktop.
Save dentarg/3090424 to your computer and use it in GitHub Desktop.
Convert a PDF to text, HTML and XML with various tools
#!/usr/bin/env ruby
require 'pdf-reader'
infile=ARGV[0]
outfile=ARGV[1]
reader = PDF::Reader.new(infile)
text = ""
reader.pages.each do |page|
text << page.text
end
File.open(outfile, 'w') do |file|
file.write(text)
end
#!/bin/sh
tdocsplit="/usr/local/Cellar/poppler/0.20.0/bin/pdftotext"
txpdf="/usr/local/Cellar/xpdf/3.03/bin/pdftotext"
tpdfreader="./pdf-reader.rb"
hpdftohtml="/usr/local/Cellar/pdftohtml/0.40a/bin/pdftohtml"
hdocsplit="/usr/local/Cellar/poppler/0.20.0/bin/pdftohtml"
rm_blank_lines() {
_infile=$1
_outfile=$2
sed '/^[ \t]*$/d' $_infile > $_outfile
}
totext() {
_command=$1
_cmdname=`echo $_command | cut -f 5 -d "/"`
if [ -z $_cmdname ]; then
_cmdname="pdfreader"
fi
_infile=$INFILE
_infilename=`echo $_infile | cut -f 1 -d .`
_filename="${_infilename}_${_cmdname}.txt"
_tmpfile="$TEMP_DIR/${_filename}"
_outfile="$OUTPUT_DIR/${_filename}"
$_command $_infile $_tmpfile
rm_blank_lines $_tmpfile $_outfile
}
tohtml() {
_command="$1 -noframes -i -c"
_cmdname=`echo $1 | cut -f 5 -d "/"`
_infile=$INFILE
_infilename=`echo $_infile | cut -f 1 -d .`
_filename="${_infilename}_${_cmdname}.html"
_tmpfile="$TEMP_DIR/${_filename}"
_outfile="$OUTPUT_DIR/${_filename}"
$_command $_infile $_tmpfile
rm_blank_lines $_tmpfile $_outfile
}
toxml() {
_command="$1 -xml -i"
_cmdname=`echo $1 | cut -f 5 -d "/"`
_infile=$INFILE
_infilename=`echo $_infile | cut -f 1 -d .`
_filename="${_infilename}_${_cmdname}.xml"
if [ $_cmdname == "pdftohtml" ]; then
_filename="${_infilename}_${_cmdname}"
fi
_tmpfile="$TEMP_DIR/${_filename}"
_outfile="$OUTPUT_DIR/${_filename}"
$_command $_infile $_tmpfile
if [ $_cmdname == "pdftohtml" ]; then
rm_blank_lines "${_tmpfile}.xml" "${_outfile}.xml"
else
rm_blank_lines $_tmpfile $_outfile
fi
}
INFILE=$1
OUTPUT_DIR="compare"
TEMP_DIR="${OUTPUT_DIR}/tmp"
if [ -z "$INFILE" ]; then
echo "Usage: $0 file.pdf"
exit 1
fi
if [ ! -d "$COMPARE_DIR" ]; then
mkdir -p "$TEMP_DIR"
fi
totext $tdocsplit
totext $txpdf
totext $tpdfreader
tohtml $hpdftohtml
tohtml $hdocsplit
toxml $hpdftohtml
toxml $hdocsplit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment