-
-
Save terencezl/61fe3f28c44a763dd1e9f060b8ff6f2e to your computer and use it in GitHub Desktop.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter | |
from pdfminer.converter import TextConverter, XMLConverter, HTMLConverter | |
from pdfminer.layout import LAParams | |
from pdfminer.pdfpage import PDFPage | |
from io import BytesIO | |
def convert_pdf(path, format='text', codec='utf-8', password=''): | |
rsrcmgr = PDFResourceManager() | |
retstr = BytesIO() | |
laparams = LAParams() | |
if format == 'text': | |
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) | |
elif format == 'html': | |
device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) | |
elif format == 'xml': | |
device = XMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) | |
else: | |
raise ValueError('provide format, either text, html or xml!') | |
fp = open(path, 'rb') | |
interpreter = PDFPageInterpreter(rsrcmgr, device) | |
maxpages = 0 | |
caching = True | |
pagenos=set() | |
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): | |
interpreter.process_page(page) | |
text = retstr.getvalue().decode() | |
fp.close() | |
device.close() | |
retstr.close() | |
return text |
Hi, thanks for the sample code.
In my case it works very well for conversion to text and HTML formats but I have a problem with XML.
When I write the conversion to an XML file via this :
open(path_xml, "w").close()
text_output = convert_pdf(path_pdf)
open(path_xml, "a", encoding="utf-8").write(text_output)
I get this message when I open the file :
Do you have any ideas? Thank you in advance.
Have you tried some other documents of different file sizes and page length? Issues I have run into in the past:
- Documents over 20MB sometimes (rarely).
- Documents that are invalid in some way. It will open in Acrobat but, has fundamental underlying problems.
- Documents with very obscure special characters.
Also try documents that are tagged and those that are not. I suggest trying a wide variety of files to see if you can come across a common theme. I've literally ran some of this on thousands of files, but one messed up file shows up sometimes. I just document it and move on.
Edit #2: Run the XML that it generates through a validator to get more information on exactly where it is failing. https://www.w3schools.com/xml/xml_validator.asp
Hi! Thanks for your advices.
The PDF file I want to convert is one of these to give you an idea (Part-B to be precise) : my PDF document
- Its size is 2985 Kb.
- I don't have the impression that its characters are dark.
- I have tested the code for different pdf's and the same error comes up each time.
On the other hand, I ran the xml output I get with this code in the validator of the Editix software and it detected an error that I had already identified via this forum : detected error
I added the end tag by hand but when I ask Google chrome to display it, it crashes. And just before the crash, the display is similar to what I had before correcting this error :
In any case, thank you very much for your help! I didn't think it would be so hard to switch from a PDF document to an XML document. I'll keep working on it. If you have any other ideas, don't hesitate!
nice! thanks for that!
init() got an unexpected keyword arguement 'codec' ... ??
EDIT : class TextConverter(PDFConverter)
does not take codec arguement
init() got an unexpected keyword arguement 'codec' ... ??
EDIT :
class TextConverter(PDFConverter)
does not take codec arguement
def __init__(self, rsrcmgr, outfp, pageno=1, laparams=None,
showpageno=False, imagewriter=None):
PDFConverter.__init__(self, rsrcmgr, outfp, pageno=pageno, laparams=laparams)
while running above code getting this error
Traceback (most recent call last):
File "C:/Users//AppData/Roaming/JetBrains/PyCharmCE2021.1/scratches/scratch_10.py", line 41, in
if name == main():
File "C:/Users//AppData/Roaming/JetBrains/PyCharmCE2021.1/scratches/scratch_10.py", line 38, in main
out = convert_pdf(fileName, codec)
File "C:/Users//AppData/Roaming/JetBrains/PyCharmCE2021.1/scratches/scratch_10.py", line 18, in convert_pdf
device = XMLConverter(rsrcmgr, retstr, laparams=laparams)
File "C:\Users*\PycharmProjects\gcs-authoring_authoring-service\venv\lib\site-packages\pdfminer\converter.py", line 407, in init
self.write_header()
File "C:\Users*****\PycharmProjects\gcs-authoring_authoring-service\venv\lib\site-packages\pdfminer\converter.py", line 411, in write_header
self.outfp.write('\n')
TypeError: a bytes-like object is required, not 'str'
Any idea only how to do the reverse? xml back to pdf?