Skip to content

Instantly share code, notes, and snippets.

@Drugoy
Forked from senderle/hand-modify-pdf.md
Created September 4, 2023 21:04
Show Gist options
  • Save Drugoy/a860db78124d139d08a4ddbbd0342f74 to your computer and use it in GitHub Desktop.
Save Drugoy/a860db78124d139d08a4ddbbd0342f74 to your computer and use it in GitHub Desktop.
So you want to modify the text of a PDF by hand

So you want to modify the text of a PDF by hand...

If you, like me, resent every dollar spent on commercial PDF tools, you might want to know how to change the text content of a PDF without having to pay for Adobe Acrobat or another PDF tool. I didn't see an obvious open-source tool that lets you dig into PDF internals, but I did discover a few useful facts about how PDFs are structured that I think may prove useful to others (or myself) in the future. They are recorded here. They are surely not universally applicable --
the PDF standard is truly Byzantine -- but they worked for my case.

This guide is Mac-oriented, but the tools are all available via most linux distributions as well.

Viewing compressed text data

You can open a PDF in a text editor and see some stuff that looks kinda readable, in a vague way, but find that none of it is the actual text of the PDF. It turns out that many PDFs store the text data in a compressed form. To view the compressed data, you can use a command line tool called qpdf. For Macs, there's a homebrew formula.

Here's a command that decompresses all compressed text streams in a given PDF (via this stackoverflow post):

qpdf --qdf --object-streams=disable in.pdf out.pdf

You can recompress the streams like so:

qpdf out-edited.pdf out-recompressed.pdf

This second command generated some errors for me, but the resulting PDF was readable using Preview.

Finding the text data

Once you've decompressed the compressed text streams, you can open the PDF in a text editor and view them! Except you have to find them. Here's what they look like in a basic form:

BT
  /Font_0 12 Tf
  288 720 Td
  <002a004800570003003600480057> Tj
ET

The PDF Reference (Third Edition, p.293) has this to say about the above:

The five lines of this example perform the following steps:

  1. Begin a text object.
  2. Set the font and font size to use, installing them as parameters in the text state...
  3. Specify a starting position on the page, setting parameters in the text object.
  4. Paint the glyphs for a string of characters there.
  5. End the text object.

Actually reading the text

As you can see from the above example, we still can't read the text. It is encoded. And if you thought to yourself "look at that hex string, I bet it's a bunch of unicode code points" -- well, I wish we lived in a kinder world too. It seems there are a million ways to specify encodings in PDFs, including custom encodings that are embedded in the file itself. Those encodings do map to unicode code points (most of the time?), so that's good. Let's assume that the file you're working with does have embedded encodings (because I have no idea how to handle other cases).

Identifying fonts associated with embedded encodings

Text encodings in PDFs are linked to specific fonts. Information about those encodings is embedded in the PDF in ways I don't understand, but there's an existing command line tool that extracts it: pdffonts. Here's an example of the output it generates:

$ pdffonts sample.pdf
name                                 type              emb sub uni prob object ID
------------------------------------ ----------------- --- --- --- ---- ---------
CLDQZB+TrebuchetMS,Bold              CID TrueType      yes yes yes           9  0
YQBAIZ+TrebuchetMS                   CID TrueType      yes yes yes          10  0

Here, the relevant fields are "emb" (meaning the encoding is embedded in the PDF) and "uni" (meaning the encoding is to unicode code points rather than to raw glyphs). Assuming both are set to "yes," we're in luck.

In the text example above, you'll notice the \Font_0 descriptor. Not all fonts in all PDFs will work this way, but in my case, those labels lined up in a straightforward way with the listing of fonts above. (So \Font_0 is referring to the font named CLDQZB+TrebuchetMS,Bold in the above table.)

Finding the embedded encoding table for the given font

Once you have determined the full name of your text's font (like CLDQZB+TrebuchetMS,Bold) you can search for it. In my case it appeared several times, but in one particular case, it appeared in a short block of commands including one that looked like this:

/ToUnicode 19 0 R

This appears to specify the object id of the encoding table. If you then search for 19 0 obj, you'll find the table. (Or at least that's how it worked in my case!)

The encoding table format

The salient part of the encoding table looks like this:

38 beginbfrange^M
<0036><0036><0053>^M
<0057><0057><0074>^M
<0044><0044><0061>^M
<0048><0048><0065>^M
<0050><0050><006D>^M
...

If yours looks different, check out the ToUnicode mapping file tutorial which describes a bunch of possible variations. In this case, the table is mapping ranges of custom encoding points to unicode points -- except these are ranges of just one character. So here, the custom point 0036 maps to the unicode point 0053 -- that is, the digit 5.

To perform this translation in an automated way, I used Python to convert the table into a dictionary, and wrote some simple encoding and decoding functions. This isn't a Python tutorial, sadly, but if you know Python or any other scripting language, you can probably work out a few different ways to solve this part of the problem.

Equipped with my encoder and decoder, I determined the custom-encoded version of the text I wanted to replace, wrote the replacement text and custom-encoded it, and used find-and-replace to swap them out. The end!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment