maphew · June 13, 2019 19:12 · maphew · Dec 11, 2015 · maphew · Dec 11, 2015
diff --git a/2015-Dec-11 - result.txt b/2015-Dec-11 - result.txt
 > pip install --upgrade https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
 Collecting https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
 C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
 C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Downloading https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
     | 307kB 5.9MB/s
 Installing collected packages: PyPDF2
  Found existing installation: PyPDF2 1.25.1
    Uninstalling PyPDF2-1.25.1:
      Successfully uninstalled PyPDF2-1.25.1
  Running setup.py install for PyPDF2
 Successfully installed PyPDF2-1.25.1

 [py27] E:\temp
 > pip list
 arcplus (0.1, d:\b\code\arcplus)
 comtypes (1.1.2)
 matplotlib (1.3.0)
 numpy (1.7.1)
 Pillow (3.0.0)
 pip (7.1.2)
 pyparsing (1.5.7)
 PyPDF2 (1.25.1)
 pywin32 (219)
 setuptools (15.0)


 [py27] E:\temp\pdf-image-extractor
 > python pdf-image-extractor.py  "Seige of Vicksburg Sample OCR.pdf"
 PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be corrected. [pdf.py:1722]
 Traceback (most recent call last):
  File "pdf-image-extractor.py", line 21, in <module>
    if xObject[obj]['/Filter'] == '/FlateDecode':
  File "C:\Python27\ArcGIS10.3\lib\site-packages\PyPDF2\generic.py", line 512, in __getitem__
    return dict.__getitem__(self, key).getObject()
 KeyError: '/Filter'
diff --git a/2015-Dec-11_pdf-image-extractor.py b/2015-Dec-11_pdf-image-extractor.py
 import PyPDF2

 from PIL import Image

 if __name__ == '__main__':
 ##    pdf = r'e:\temp\dctdecode.pdf'
    pdf = r'e:\temp\Seige of Vicksburg Sample OCR.pdf'

    input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
    page0 = input1.getPage(0)
    xObject = page0['/Resources']['/XObject'].getObject()

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()
	> pip install --upgrade https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
	Collecting https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
	C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
	InsecurePlatformWarning
	C:\Python27\ArcGIS10.3\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
	InsecurePlatformWarning
	Downloading https://github.com/sylvainpelissier/PyPDF2/archive/master.zip
	\| 307kB 5.9MB/s
	Installing collected packages: PyPDF2
	Found existing installation: PyPDF2 1.25.1
	Uninstalling PyPDF2-1.25.1:
	Successfully uninstalled PyPDF2-1.25.1
	Running setup.py install for PyPDF2
	Successfully installed PyPDF2-1.25.1

	[py27] E:\temp
	> pip list
	arcplus (0.1, d:\b\code\arcplus)
	comtypes (1.1.2)
	matplotlib (1.3.0)
	numpy (1.7.1)
	Pillow (3.0.0)
	pip (7.1.2)
	pyparsing (1.5.7)
	PyPDF2 (1.25.1)
	pywin32 (219)
	setuptools (15.0)


	[py27] E:\temp\pdf-image-extractor
	> python pdf-image-extractor.py "Seige of Vicksburg Sample OCR.pdf"
	PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be corrected. [pdf.py:1722]
	Traceback (most recent call last):
	File "pdf-image-extractor.py", line 21, in <module>
	if xObject[obj]['/Filter'] == '/FlateDecode':
	File "C:\Python27\ArcGIS10.3\lib\site-packages\PyPDF2\generic.py", line 512, in __getitem__
	return dict.__getitem__(self, key).getObject()
	KeyError: '/Filter'
	import PyPDF2

	from PIL import Image

	if __name__ == '__main__':
	## pdf = r'e:\temp\dctdecode.pdf'
	pdf = r'e:\temp\Seige of Vicksburg Sample OCR.pdf'

	input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
	page0 = input1.getPage(0)
	xObject = page0['/Resources']['/XObject'].getObject()

	for obj in xObject:
	if xObject[obj]['/Subtype'] == '/Image':
	size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
	data = xObject[obj].getData()
	if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
	mode = "RGB"
	else:
	mode = "P"

	if xObject[obj]['/Filter'] == '/FlateDecode':
	img = Image.frombytes(mode, size, data)
	img.save(obj[1:] + ".png")
	elif xObject[obj]['/Filter'] == '/DCTDecode':
	img = open(obj[1:] + ".jpg", "wb")
	img.write(data)
	img.close()
	elif xObject[obj]['/Filter'] == '/JPXDecode':
	img = open(obj[1:] + ".jp2", "wb")
	img.write(data)
	img.close()