Last active
May 12, 2017 19:33
-
-
Save jdraths/328d54fb232a98c3d2895845d20ff475 to your computer and use it in GitHub Desktop.
Extract data from pdf with poppler
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
=begin | |
[REFERENCE](https://linux.die.net/man/1/pdftotext) | |
$ brew install poppler | |
> `pdftotext -h` | |
pdftotext version 0.48.0 | |
Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.org | |
Copyright 1996-2011 Glyph & Cog, LLC | |
Usage: pdftotext [options] <PDF-file> [<text-file>] | |
-f <int> : first page to convert | |
-l <int> : last page to convert | |
-r <fp> : resolution, in DPI (default is 72) | |
-x <int> : x-coordinate of the crop area top left corner | |
-y <int> : y-coordinate of the crop area top left corner | |
-W <int> : width of crop area in pixels (default is 0) | |
-H <int> : height of crop area in pixels (default is 0) | |
-layout : maintain original physical layout | |
-fixed <fp> : assume fixed-pitch (or tabular) text | |
-raw : keep strings in content stream order | |
-htmlmeta : generate a simple HTML file, including the meta information | |
-enc <string> : output text encoding name | |
-listenc : list available encodings | |
-eol <string> : output end-of-line convention (unix, dos, or mac) | |
-nopgbrk : don't insert page breaks between pages | |
-bbox : output bounding box for each word and page size to html. Sets -htmlmeta | |
-bbox-layout : like -bbox but with extra layout bounding box data. Sets -htmlmeta | |
-opw <string> : owner password (for encrypted files) | |
-upw <string> : user password (for encrypted files) | |
-q : don't print any messages or errors | |
-v : print copyright and version info | |
-h : print usage information | |
-help : print usage information | |
--help : print usage information | |
-? : print usage information | |
=end | |
def extract_to_text(pdf_path) | |
command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ') # add '-' as the last argument to print results inline | |
`#{command}` | |
end | |
def extract_to_html(pdf_path) | |
command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ') | |
`#{command}` | |
end | |
=begin | |
> `pdfimages -h` | |
pdfimages version 0.48.0 | |
Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.org | |
Copyright 1996-2011 Glyph & Cog, LLC | |
Usage: pdfimages [options] <PDF-file> <image-root> | |
-f <int> : first page to convert | |
-l <int> : last page to convert | |
-png : change the default output format to PNG | |
-tiff : change the default output format to TIFF | |
-j : write JPEG images as JPEG files | |
-jp2 : write JPEG2000 images as JP2 files | |
-jbig2 : write JBIG2 images as JBIG2 files | |
-ccitt : write CCITT images as CCITT files | |
-all : equivalent to -png -tiff -j -jp2 -jbig2 -ccitt | |
-list : print list of images instead of saving | |
-opw <string> : owner password (for encrypted files) | |
-upw <string> : user password (for encrypted files) | |
-p : include page numbers in output file names | |
-q : don't print any messages or errors | |
-v : print copyright and version info | |
-h : print usage information | |
-help : print usage information | |
--help : print usage information | |
-? : print usage information | |
=end | |
def extract_to_img(pdf_path, output_path) | |
command = ['pdfimages', '-png', Shellwords.escape(pdf_path)].join(' ') | |
`#{command}` | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment