Skip to content

Instantly share code, notes, and snippets.

@Juul
Created July 13, 2019 22:07
Show Gist options
  • Save Juul/95190dda496d2029f1e98a54af5b30c1 to your computer and use it in GitHub Desktop.
Save Juul/95190dda496d2029f1e98a54af5b30c1 to your computer and use it in GitHub Desktop.
Guide on book scanning using camera-based scanner like the one at noisebridge

Today I went to noisebridge and scanned Wonders Through the Microscope and it is now available for download on the freestore! I will likely also generate a color version of the PDF but the book is in black and white so it really doesn't add much.

This is my first attempt at book scanning and post-processing using only open source tools so I thought I'd share this book scanning guide based on my experiences :)

Scan

I used the noisebridge book scanner hardware and their python script which remote-controls two DSLR cameras and downloads the photos over USB. It worked very well.

Note that in addition to gphoto2 you will also need jpegtran. This command will get you both:

apt install gphoto2 libjpeg-turbo-progs

Cleanup

I used the excellent ScanTailor (latest stable version) which will automatically rotate, split pages, dewarp, pull out the actual content, add margins and optionally convert to black and white.

It isn't perfect but makes it really easy to manually fix any problems. The dewarp is still experimental and failed for me on about one in ten pages. If anyone knows of a better dewarper let me know!

OCR

For OCR I first tried using ocrfeeder which is a front-end for Tesseract. Unfortunately it didn't work for me at all. I even compiled the latest version from git since the last release was three years ago and there are lots of recent git commits.

I ended up using Tesseract directly. You should be able to install tesseract from your distro's repository but the 4.0 alpha isn't in any distro repos yet so I compiled the latest version from source which requires first compiling Leptonica from source and afterwards downloading the appropriate Tesseract training data file to /usr/local/share/tessdata. I recommend downloading the following:

You will also need to manually copy tessdata/pdf.ttf from the git repo to /usr/local/share/tessdata.

Now you can simly do:

ls *.tif | tesseract -c tessedit_create_pdf=1 - ./book

The above command may be different for versions older than version 4.x

You may get a warnings saying "Image too small to scale" and "Line cannot be recognized". These can be ignored.

The result is a single book.pdf.

Compress PDF

For color and grayscale

You can use shrinkpdf to compress the PDF but this doesn't work with monochrome unless you have a really bleeding edge version of ghostscript (see this bug) and unfortunately if you followed this guide and used the black and white output from ScanTailor then you have monochrome images.

If you still want to use shrinkpdf, first install ghostscript:

apt install ghostscript

Then run shrinkpdf:

./shrinkpdf.sh book.pdf book.small.pdf

If you get a segmentation fault then you have a really old version of ghostscript.

For monochrome

You can try ps2pdf but it only shrunk my 13.7 MB file to 12.1 MB:

ps2pdf book.pdf book.small.pdf

Set PDF metadata

It's a good idea to set the book title and author in the pdf metadata. You can use exiftool for this:

sudo apt install exiftool
exiftool -Title="My Book's Title" -Author="My Book's Author" book.small.pdf

Convert to epub

I used the ebook-convert command included in the latest release of Calibre:

ebook-convert book.small.pdf book.small.epub --enable-heuristics --smarten-punctuation

Be aware that this can take a long time. It took ~20 mins on my laptop and it was stuck at 1% completion for most of that time.

ebook-convert has a lot of options. They are documented here.

You may have issues with ebook-convert tool if your book has more than one column of text per page.

Issues

The pdf created by Tesseract does not have a font specified so when you drag to select while viewing the pdf the selections shows up as a bunch of black squares rather than the actual text. Copy-paste and search still works so it's a minor issue but still annoying.

It would be cool to have a table of contents with links to the appropriate pages but that would probably have to be done manually.

Other tools

I came across unpaper but haven't yet tried it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment