If you, like me, resent every dollar spent on commercial PDF tools,
you might want to know how to change the text content of a PDF without
having to pay for Adobe Acrobat or another PDF tool. I didn't see an
obvious open-source tool that lets you dig into PDF internals, but I
did discover a few useful facts about how PDFs are structured that
I think may prove useful to others (or myself) in the future. They
are recorded here. They are surely not universally applicable --
the PDF standard is truly Byzantine -- but they worked for my case.
This guide is Mac-oriented, but the tools are all available via most linux distributions as well.
You can open a PDF in a text editor and see some stuff that looks kinda
readable, in a vague way, but find that none of it is the actual text
of the PDF. It turns out that many PDFs store the text data in a
compressed form. To view the compressed data, you can use a command line
tool called qpdf
. For Macs, there's a homebrew formula.
Here's a command that decompresses all compressed text streams in a given PDF (via this stackoverflow post):
qpdf --qdf --object-streams=disable in.pdf out.pdf
You can recompress the streams like so:
qpdf out-edited.pdf out-recompressed.pdf
This second command generated some errors for me, but the resulting PDF was readable using Preview.
Once you've decompressed the compressed text streams, you can open the PDF in a text editor and view them! Except you have to find them. Here's what they look like in a basic form:
BT
/Font_0 12 Tf
288 720 Td
<002a004800570003003600480057> Tj
ET
The PDF Reference (Third Edition, p.293) has this to say about the above:
The five lines of this example perform the following steps:
- Begin a text object.
- Set the font and font size to use, installing them as parameters in the text state...
- Specify a starting position on the page, setting parameters in the text object.
- Paint the glyphs for a string of characters there.
- End the text object.
As you can see from the above example, we still can't read the text. It is encoded. And if you thought to yourself "look at that hex string, I bet it's a bunch of unicode code points" -- well, I wish we lived in a kinder world too. It seems there are a million ways to specify encodings in PDFs, including custom encodings that are embedded in the file itself. Those encodings do map to unicode code points (most of the time?), so that's good. Let's assume that the file you're working with does have embedded encodings (because I have no idea how to handle other cases).
Text encodings in PDFs are linked to specific fonts. Information about those
encodings is embedded in the PDF in ways I don't understand, but there's an
existing command line tool that extracts it: pdffonts
. Here's an example
of the output it generates:
$ pdffonts sample.pdf
name type emb sub uni prob object ID
------------------------------------ ----------------- --- --- --- ---- ---------
CLDQZB+TrebuchetMS,Bold CID TrueType yes yes yes 9 0
YQBAIZ+TrebuchetMS CID TrueType yes yes yes 10 0
Here, the relevant fields are "emb" (meaning the encoding is embedded in the PDF) and "uni" (meaning the encoding is to unicode code points rather than to raw glyphs). Assuming both are set to "yes," we're in luck.
In the text example above, you'll notice the \Font_0
descriptor. Not
all fonts in all PDFs will work this way, but in my case, those labels
lined up in a straightforward way with the listing of fonts above. (So
\Font_0
is referring to the font named CLDQZB+TrebuchetMS,Bold
in the
above table.)
Once you have determined the full name of your text's font (like
CLDQZB+TrebuchetMS,Bold
) you can search for it. In my case it appeared
several times, but in one particular case, it appeared in a short
block of commands including one that looked like this:
/ToUnicode 19 0 R
This appears to specify the object id of the encoding table. If you then
search for 19 0 obj
, you'll find the table. (Or at least that's how
it worked in my case!)
The salient part of the encoding table looks like this:
38 beginbfrange^M
<0036><0036><0053>^M
<0057><0057><0074>^M
<0044><0044><0061>^M
<0048><0048><0065>^M
<0050><0050><006D>^M
...
If yours looks different, check out the ToUnicode mapping file tutorial
which describes a bunch of possible variations. In this case, the table is
mapping ranges of custom encoding points to unicode points -- except these
are ranges of just one character. So here, the custom point 0036
maps to the
unicode point 0053
-- that is, the digit 5
.
To perform this translation in an automated way, I used Python to convert the table into a dictionary, and wrote some simple encoding and decoding functions. This isn't a Python tutorial, sadly, but if you know Python or any other scripting language, you can probably work out a few different ways to solve this part of the problem.
Equipped with my encoder and decoder, I determined the custom-encoded version of the text I wanted to replace, wrote the replacement text and custom-encoded it, and used find-and-replace to swap them out. The end!