Last active
June 13, 2019 19:12
-
-
Save maphew/fe6ba4bf9ed2bc98ecf5 to your computer and use it in GitHub Desktop.
From http://stackoverflow.com/a/34116472/14420 in answer to " Extract images from PDF without resampling, in python?"
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> pip install --upgrade https://github.com/sylvainpelissier/PyPDF2/archive/master.zip | |
Collecting https://github.com/sylvainpelissier/PyPDF2/archive/master.zip | |
C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning. | |
InsecurePlatformWarning | |
C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning. | |
InsecurePlatformWarning | |
Downloading https://github.com/sylvainpelissier/PyPDF2/archive/master.zip | |
| 307kB 5.9MB/s | |
Installing collected packages: PyPDF2 | |
Found existing installation: PyPDF2 1.25.1 | |
Uninstalling PyPDF2-1.25.1: | |
Successfully uninstalled PyPDF2-1.25.1 | |
Running setup.py install for PyPDF2 | |
Successfully installed PyPDF2-1.25.1 | |
[py27] E:\temp | |
> pip list | |
arcplus (0.1, d:\b\code\arcplus) | |
comtypes (1.1.2) | |
matplotlib (1.3.0) | |
numpy (1.7.1) | |
Pillow (3.0.0) | |
pip (7.1.2) | |
pyparsing (1.5.7) | |
PyPDF2 (1.25.1) | |
pywin32 (219) | |
setuptools (15.0) | |
[py27] E:\temp\pdf-image-extractor | |
> python pdf-image-extractor.py "Seige of Vicksburg Sample OCR.pdf" | |
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be corrected. [pdf.py:1722] | |
Traceback (most recent call last): | |
File "pdf-image-extractor.py", line 21, in <module> | |
if xObject[obj]['/Filter'] == '/FlateDecode': | |
File "C:\Python27\ArcGIS10.3\lib\site-packages\PyPDF2\generic.py", line 512, in __getitem__ | |
return dict.__getitem__(self, key).getObject() | |
KeyError: '/Filter' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import PyPDF2 | |
from PIL import Image | |
if __name__ == '__main__': | |
## pdf = r'e:\temp\dctdecode.pdf' | |
pdf = r'e:\temp\Seige of Vicksburg Sample OCR.pdf' | |
input1 = PyPDF2.PdfFileReader(open(pdf, "rb")) | |
page0 = input1.getPage(0) | |
xObject = page0['/Resources']['/XObject'].getObject() | |
for obj in xObject: | |
if xObject[obj]['/Subtype'] == '/Image': | |
size = (xObject[obj]['/Width'], xObject[obj]['/Height']) | |
data = xObject[obj].getData() | |
if xObject[obj]['/ColorSpace'] == '/DeviceRGB': | |
mode = "RGB" | |
else: | |
mode = "P" | |
if xObject[obj]['/Filter'] == '/FlateDecode': | |
img = Image.frombytes(mode, size, data) | |
img.save(obj[1:] + ".png") | |
elif xObject[obj]['/Filter'] == '/DCTDecode': | |
img = open(obj[1:] + ".jpg", "wb") | |
img.write(data) | |
img.close() | |
elif xObject[obj]['/Filter'] == '/JPXDecode': | |
img = open(obj[1:] + ".jp2", "wb") | |
img.write(data) | |
img.close() |
I removed the old result console reports, since the new version makes those errors obsolete.
Traceback (most recent call last):
File "/xx/xx/extract.py", line 16, in <module>
data = xObject[obj].getData()
File "/Library/Python/2.7/site-packages/PyPDF2/generic.py", line 841, in getData
decoded._data = filters.decodeStreamData(self)
File "/Library/Python/2.7/site-packages/PyPDF2/filters.py", line 361, in decodeStreamData
raise NotImplementedError("unsupported filter %s" % filterType)
NotImplementedError: unsupported filter /CCITTFaxDecode
getpdfimage('ana.pdf')
Traceback (most recent call last):
File "", line 1, in
getpdfimage('ana.pdf')
File "", line 9, in getpdfimage
data = xObject[obj].getData()
File "C:\Users\nmb31\Anaconda3\lib\site-packages\PyPDF2\generic.py", line 841, in getData
decoded._data = filters.decodeStreamData(self)
File "C:\Users\nmb31\Anaconda3\lib\site-packages\PyPDF2\filters.py", line 361, in decodeStreamData
raise NotImplementedError("unsupported filter %s" % filterType)
NotImplementedError: unsupported filter /DCTDecode
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
pdf's used: