Today I went to noisebridge and scanned Wonders Through the Microscope and it is now available for download on the freestore! I will likely also generate a color version of the PDF but the book is in black and white so it really doesn't add much.
This is my first attempt at book scanning and post-processing using only open source tools so I thought I'd share this book scanning guide based on my experiences :)
I used the noisebridge book scanner hardware and their python script which remote-controls two DSLR cameras and downloads the photos over USB. It worked very well.
Note that in addition to gphoto2
you will also need jpegtran
. This command will get you both:
apt install gphoto2 libjpeg-turbo-progs
I used the excellent ScanTailor (latest stable version) which will automatically rotate, split pages, dewarp, pull out the actual content, add margins and optionally convert to black and white.
It isn't perfect but makes it really easy to manually fix any problems. The dewarp is still experimental and failed for me on about one in ten pages. If anyone knows of a better dewarper let me know!
For OCR I first tried using ocrfeeder which is a front-end for Tesseract. Unfortunately it didn't work for me at all. I even compiled the latest version from git since the last release was three years ago and there are lots of recent git commits.
I ended up using Tesseract directly. You should be able to install tesseract from your distro's repository but the 4.0 alpha isn't in any distro repos yet so I compiled the latest version from source which requires first compiling Leptonica from source and afterwards downloading the appropriate Tesseract training data file to /usr/local/share/tessdata
. I recommend downloading the following:
You will also need to manually copy tessdata/pdf.ttf
from the git repo to /usr/local/share/tessdata
.
Now you can simly do:
ls *.tif | tesseract -c tessedit_create_pdf=1 - ./book
The above command may be different for versions older than version 4.x
You may get a warnings saying "Image too small to scale" and "Line cannot be recognized". These can be ignored.
The result is a single book.pdf
.
You can use shrinkpdf to compress the PDF but this doesn't work with monochrome unless you have a really bleeding edge version of ghostscript (see this bug) and unfortunately if you followed this guide and used the black and white output from ScanTailor then you have monochrome images.
If you still want to use shrinkpdf
, first install ghostscript:
apt install ghostscript
Then run shrinkpdf
:
./shrinkpdf.sh book.pdf book.small.pdf
If you get a segmentation fault then you have a really old version of ghostscript.
You can try ps2pdf
but it only shrunk my 13.7 MB file to 12.1 MB:
ps2pdf book.pdf book.small.pdf
It's a good idea to set the book title and author in the pdf metadata. You can use exiftool for this:
sudo apt install exiftool
exiftool -Title="My Book's Title" -Author="My Book's Author" book.small.pdf
I used the ebook-convert
command included in the latest release of Calibre:
ebook-convert book.small.pdf book.small.epub --enable-heuristics --smarten-punctuation
Be aware that this can take a long time. It took ~20 mins on my laptop and it was stuck at 1% completion for most of that time.
ebook-convert
has a lot of options. They are documented here.
You may have issues with ebook-convert
tool if your book has more than one column of text per page.
The pdf created by Tesseract does not have a font specified so when you drag to select while viewing the pdf the selections shows up as a bunch of black squares rather than the actual text. Copy-paste and search still works so it's a minor issue but still annoying.
It would be cool to have a table of contents with links to the appropriate pages but that would probably have to be done manually.
I came across unpaper but haven't yet tried it out.