Skip to content

Instantly share code, notes, and snippets.

@tiarno
Last active February 2, 2024 22:15
Show Gist options
  • Select an option

  • Save tiarno/8a2995e70cee42f01e79 to your computer and use it in GitHub Desktop.

Select an option

Save tiarno/8a2995e70cee42f01e79 to your computer and use it in GitHub Desktop.
find PDF font info with PyPDF2, example code
from PyPDF2 import PdfFileReader
from pprint import pprint
def walk(obj, fnt, emb):
'''
If there is a key called 'BaseFont', that is a font that is used in the document.
If there is a key called 'FontName' and another key in the same dictionary object
that is called 'FontFilex' (where x is null, 2, or 3), then that fontname is
embedded.
We create and add to two sets, fnt = fonts used and emb = fonts embedded.
'''
if not hasattr(obj, 'keys'):
return None, None
fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])
if '/BaseFont' in obj:
fnt.add(obj['/BaseFont'])
if '/FontName' in obj:
if [x for x in fontkeys if x in obj]:# test to see if there is FontFile
emb.add(obj['/FontName'])
for k in obj.keys():
walk(obj[k], fnt, emb)
return fnt, emb# return the sets for each page
if __name__ == '__main__':
fname = 'myfile.pdf'
pdf = PdfFileReader(fname)
fonts = set()
embedded = set()
for page in pdf.pages:
obj = page.getObject()
# updated via this answer:
# https://stackoverflow.com/questions/60876103/use-pypdf2-to-detect-non-embedded-fonts-in-pdf-file-generated-by-google-docs/60895334#60895334
# in order to handle lists inside objects. Thanks misingnoglic !
# untested code since I don't have such a PDF to play with.
if type(obj) == PyPDF2.generic.ArrayObject: # You can also do ducktyping here
for i in obj:
if hasattr(i, 'keys'):
f, e = walk(i, fonts, embedded_fonts)
fonts = fonts.union(f)
embedded = embedded.union(e)
else:
f, e = walk(obj['/Resources'], fonts, embedded)
fonts = fonts.union(f)
embedded = embedded.union(e)
unembedded = fonts - embedded
print 'Font List'
pprint(sorted(list(fonts)))
if unembedded:
print '\nUnembedded Fonts'
pprint(unembedded)
@zenssh
Copy link
Copy Markdown

zenssh commented Jan 19, 2016

Thank you very much for the script - used it to get the fonts of a pdf that I couldn't extract by other means!

@LizaKoz
Copy link
Copy Markdown

LizaKoz commented Dec 7, 2018

That's a very handy! Is it also possuble to get a text wich is written bold?

@Shohreh
Copy link
Copy Markdown

Shohreh commented Mar 17, 2020

First, install the two packages:

pip install PyPDF2
pip install pprint

Next, if using Python3, edit the following lines:

print('Font List')
pprint(sorted(list(fonts)))
…
print('\nUnembedded Fonts')
pprint(unembedded)

@misingnoglic
Copy link
Copy Markdown

Hi - FYI I tried to use this script to find unembedded fonts in PDFs. I downloaded a PDF from Google docs with just Arial, and Adobe Reader says the font is embedded, but this script says it's not.

@tiarno
Copy link
Copy Markdown
Author

tiarno commented Mar 27, 2020

thanks for the info. This works for the PDFs I've come across, but there are so many different structures possible inside a PDF. I would definitely believe Adobe. If you want further confirmation, pdffonts is a command line tool you might be interested in. https://www.xpdfreader.com/pdffonts-man.html

@misingnoglic
Copy link
Copy Markdown

Thanks - I'll try to reverse engineer what they have done. In the meantime I've asked on Stack Overflow: https://stackoverflow.com/questions/60876103/use-pypdf2-to-detect-non-embedded-fonts-in-pdf-file-generated-by-google-docs

@misingnoglic
Copy link
Copy Markdown

Figured it out! You need to modify the script to handle lists as well. I put an example in the stackoverflow answer:
https://stackoverflow.com/questions/60876103/use-pypdf2-to-detect-non-embedded-fonts-in-pdf-file-generated-by-google-docs/60895334#60895334

@tiarno
Copy link
Copy Markdown
Author

tiarno commented Mar 28, 2020

I updated the code as best I could. Untested though. Thanks for the info!

@pranav1698
Copy link
Copy Markdown

That's a very handy! Is it also possuble to get a text wich is written bold?

@tiarno any way we can do this as well

@ilcaa72
Copy link
Copy Markdown

ilcaa72 commented Apr 10, 2020

hi, can someone help a rookie out... this is a function that will return the names of the various fonts within a pdf... correct?
so i should feed it a pdf ( i assume this is the object param of the function) but what are the other 2 params? it seems like its asking me for 2 fonts..

quick explain... thanks

@vagnit
Copy link
Copy Markdown

vagnit commented Aug 21, 2020

That's a very handy! Is it also possuble to get a text wich is written bold?

@tiarno any way we can do this as well

Indeed this would be really useful!

@benthorner
Copy link
Copy Markdown

We've been using this code for quite a while to detect unembedded fonts.

I made a change recently to support composite (Type 0) fonts: alphagov/notifications-template-preview@83c137b

@tiarno
Copy link
Copy Markdown
Author

tiarno commented Jun 25, 2021

We've been using this code for quite a while to detect unembedded fonts.

I made a change recently to support composite (Type 0) fonts: alphagov/notifications-template-preview@83c137b

Very nice, I didn't even know about Type0 fonts. Thanks for the comment.

@medhadeeptimahanti
Copy link
Copy Markdown

Thank you for this! It helped me out. I've gotten outputs for PDFs I've tested as follows:

{'SymbolMT', 'ArialMT', 'BCDEEE', 'BCDGEE', 'BCDFEE'}

Is there a reason why it shows 'ArialMT' instead of just 'Arial' and what exactly is the 'BCDGEE'? Is there a way to get rid of the 'MT' in 'ArialMT' and the 'BCDGEE'?

@mteam88
Copy link
Copy Markdown

mteam88 commented Jan 10, 2022

I would love this in python 3.

@rojvv
Copy link
Copy Markdown

rojvv commented May 7, 2023

@mteam88 here:

from PyPDF2 import PdfReader
from pprint import pprint
import PyPDF2

def walk(obj, fnt, emb):
    '''
    If there is a key called 'BaseFont', that is a font that is used in the document.
    If there is a key called 'FontName' and another key in the same dictionary object
    that is called 'FontFilex' (where x is null, 2, or 3), then that fontname is 
    embedded.
    
    We create and add to two sets, fnt = fonts used and emb = fonts embedded.
    '''
    if not hasattr(obj, 'keys'):
        return None, None
    fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])
    if '/BaseFont' in obj:
        fnt.add(obj['/BaseFont'])
    if '/FontName' in obj:
        if [x for x in fontkeys if x in obj]:# test to see if there is FontFile
            emb.add(obj['/FontName'])

    for k in obj.keys():
        walk(obj[k], fnt, emb)

    return fnt, emb# return the sets for each page

if __name__ == '__main__':
    fname = 'myfile.pdf'
    pdf = PdfReader(fname)
    fonts = set()
    embedded = set()
    for page in pdf.pages:
        obj = page.get_object()
        # updated via this answer:
        # https://stackoverflow.com/questions/60876103/use-pypdf2-to-detect-non-embedded-fonts-in-pdf-file-generated-by-google-docs/60895334#60895334 
        # in order to handle lists inside objects. Thanks misingnoglic !
        # untested code since I don't have such a PDF to play with.
        if type(obj) == PyPDF2.generic.ArrayObject:  # You can also do ducktyping here
            for i in obj:
                if hasattr(i, 'keys'):
                    f, e = walk(i, fonts, embedded_fonts)
                    fonts = fonts.union(f)
                    embedded = embedded.union(e)
        else:
            f, e = walk(obj['/Resources'], fonts, embedded)
            fonts = fonts.union(f)
            embedded = embedded.union(e)

    unembedded = fonts - embedded
    print('Font List')
    pprint(sorted(list(fonts)))
    if unembedded:
        print ('\nUnembedded Fonts')
        pprint(unembedded)

@J0hn3ch
Copy link
Copy Markdown

J0hn3ch commented Aug 1, 2023

Hi! Is there a way to get bold words or bold phrases inside a page containing information about font used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment