Caveat
You're not going to get a beautiful EPUB out the other end - if that's what you're looking for, expect to do some manual clean-up work yourself.
Basic order of operations:
- Convert your PDF to an OCR-friendly format
- OCR that shit into plaintext
- Convert that plaintext into your format of choice (in this case, an EPUB)
Tools of the trade:
These instructions assume you're running a Mac OS
- you should already have Ghostscript (try
gs -v
) brew install tesseract --all-languages
- then go get a snackbrew install pandoc
- this one's a beauty
We're going to translate our PDF into a TIFF. Keep in mind this is really only useful for PDFs that consist solely of images. If your PDF contains text, you might want to avoid outputting to a raster image.
gs -q -r300x300 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif myscan.pdf -c quit
Notes:
- the
-r
flag controls DPI -sDEVICE
in the above form outputs black & white which will suffice for our EPUB needs
You should really man gs
, though.
Tesseract is going to read our image and spit out text - it's glorious.
tesseract -l eng mybook.tif mybook
(-l eng
denotes that mybook.tif
includes english text)
This is the fun / awful part (depending on your personality). The outputted text will not contain any structure, and unstructured is exactly what an ebook is not.
Pandoc is quite awesome at converting Markdown to EPUB (among many other formats), but I'd stick with Markdown. Basically you'll want to skim your mybook.txt
file and throw a #
in front of chapter headers, remove extraneous text (ie: from page headers & footers), and add in any relevant images (pandoc sources images relative to your source txt file and puts them in the EPUB!).
Then:
pandoc mybook.txt -o mybook.epub
--all-languages
tag is not supported by brew anymore.https://tesseract-ocr.github.io/tessdoc/Installation.html#homebrew
so depending on your needs