Skip to content

Instantly share code, notes, and snippets.

@terencezl
Created April 20, 2017 04:39
Show Gist options
  • Save terencezl/61fe3f28c44a763dd1e9f060b8ff6f2e to your computer and use it in GitHub Desktop.
Save terencezl/61fe3f28c44a763dd1e9f060b8ff6f2e to your computer and use it in GitHub Desktop.
use pdfminer to extract pdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter, XMLConverter, HTMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO
def convert_pdf(path, format='text', codec='utf-8', password=''):
rsrcmgr = PDFResourceManager()
retstr = BytesIO()
laparams = LAParams()
if format == 'text':
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
elif format == 'html':
device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
elif format == 'xml':
device = XMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
else:
raise ValueError('provide format, either text, html or xml!')
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue().decode()
fp.close()
device.close()
retstr.close()
return text
@joelhsmith
Copy link

joelhsmith commented Jun 14, 2019

Any idea how to add functionality to export/check for tags too?

Edit: I figured it out. Maybe this will help someone later: https://gist.github.com/joelhsmith/5e6ec7ee70ab4b89d7bc5700e9e07fde

@nitinsurya
Copy link

Any idea only how to do the reverse? xml back to pdf?

Copy link

ghost commented Jun 10, 2020

Hi, thanks for the sample code.
In my case it works very well for conversion to text and HTML formats but I have a problem with XML.

When I write the conversion to an XML file via this :

open(path_xml, "w").close()
text_output = convert_pdf(path_pdf)
open(path_xml, "a", encoding="utf-8").write(text_output)

I get this message when I open the file :
image

Do you have any ideas? Thank you in advance.

@joelhsmith
Copy link

joelhsmith commented Jun 10, 2020

Have you tried some other documents of different file sizes and page length? Issues I have run into in the past:

  • Documents over 20MB sometimes (rarely).
  • Documents that are invalid in some way. It will open in Acrobat but, has fundamental underlying problems.
  • Documents with very obscure special characters.

Also try documents that are tagged and those that are not. I suggest trying a wide variety of files to see if you can come across a common theme. I've literally ran some of this on thousands of files, but one messed up file shows up sometimes. I just document it and move on.

Edit #2: Run the XML that it generates through a validator to get more information on exactly where it is failing. https://www.w3schools.com/xml/xml_validator.asp

Copy link

ghost commented Jun 10, 2020

Hi! Thanks for your advices.
The PDF file I want to convert is one of these to give you an idea (Part-B to be precise) : my PDF document

  • Its size is 2985 Kb.
  • I don't have the impression that its characters are dark.
  • I have tested the code for different pdf's and the same error comes up each time.

On the other hand, I ran the xml output I get with this code in the validator of the Editix software and it detected an error that I had already identified via this forum : detected error
I added the end tag by hand but when I ask Google chrome to display it, it crashes. And just before the crash, the display is similar to what I had before correcting this error :
image

In any case, thank you very much for your help! I didn't think it would be so hard to switch from a PDF document to an XML document. I'll keep working on it. If you have any other ideas, don't hesitate!

@marcello-telles
Copy link

nice! thanks for that!

@DevilBoy007
Copy link

DevilBoy007 commented Aug 12, 2021

init() got an unexpected keyword arguement 'codec' ... ??

EDIT : class TextConverter(PDFConverter) does not take codec arguement

@DevilBoy007
Copy link

DevilBoy007 commented Aug 12, 2021

init() got an unexpected keyword arguement 'codec' ... ??

EDIT : class TextConverter(PDFConverter) does not take codec arguement

def __init__(self, rsrcmgr, outfp, pageno=1, laparams=None,
showpageno=False, imagewriter=None):
PDFConverter.__init__(self, rsrcmgr, outfp, pageno=pageno, laparams=laparams)

@NaveenTanguduTR
Copy link

while running above code getting this error

Traceback (most recent call last):
File "C:/Users//AppData/Roaming/JetBrains/PyCharmCE2021.1/scratches/scratch_10.py", line 41, in
if name == main():
File "C:/Users/
/AppData/Roaming/JetBrains/PyCharmCE2021.1/scratches/scratch_10.py", line 38, in main
out = convert_pdf(fileName, codec)
File "C:/Users//AppData/Roaming/JetBrains/PyCharmCE2021.1/scratches/scratch_10.py", line 18, in convert_pdf
device = XMLConverter(rsrcmgr, retstr, laparams=laparams)
File "C:\Users*
\PycharmProjects\gcs-authoring_authoring-service\venv\lib\site-packages\pdfminer\converter.py", line 407, in init
self.write_header()
File "C:\Users*
****\PycharmProjects\gcs-authoring_authoring-service\venv\lib\site-packages\pdfminer\converter.py", line 411, in write_header
self.outfp.write('\n')
TypeError: a bytes-like object is required, not 'str'

@SLadovir
Copy link

SLadovir commented Nov 12, 2021

image
image
image
We need to add the last line:
image
image
After which we get this:
image

using crutches we can fix it this way:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment