Skip to content

Instantly share code, notes, and snippets.

@matthiasg
Last active August 29, 2015 14:07
Show Gist options
  • Save matthiasg/3856fb5b4dde84fc050c to your computer and use it in GitHub Desktop.
Save matthiasg/3856fb5b4dde84fc050c to your computer and use it in GitHub Desktop.
Tesseract and leptonica on SmartOS

Download

Leptonica from: http://www.leptonica.org/download.html

Tesseract from: https://code.google.com/p/tesseract-ocr/

Leptonica

Configure

from root folder after extracting with tar -xzf .tgz

tar xzf leptonica-1.71.tar.gz
cd leptonica-1.71
CFLAGS="-D__SOLARIS__" ./configure --prefix=/opt/local

Compile

make
make install

Compile tesseract

Install Tools

pkgin in autoconf automake libtool

Configure

./autogen.sh
LIBLEPT_HEADERSDIR="/opt/local/include/leptonica" CFLAGS="-D__SOLARIS__" ./configure --prefix="/opt/local"

Update Makefile manually (https://code.google.com/p/tesseract-ocr/issues/detail?id=915 https://code.google.com/p/tesseract-ocr/issues/detail?id=582)

Find LIBS = -lept in ./Makefile, ./api/Makefile and ./training/Makefile and change to

LIBS = -llept -lsocket -lnsl -lrt -lxnet

Compile

make
make install

Tesseract should now be in path.

Use Tesseract.

Set environment variable for tesseract to find training data

export TESSDATA_PREFIX=/opt/local/share/tessdata

Download Tesseract training data

E.g:

wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.deu.tar.gz
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.ara.tar.gz
tar xzfv tesseract-ocr-3.02.eng.tar.gz
tar xzfv tesseract-ocr-3.02.deu.tar.gz
tar xzfv tesseract-ocr-3.02.ara.tar.gz

Install Tessdata

cp tesseract-ocr/tessdata/eng* /opt/local/share/tessdata
cp tesseract-ocr/tessdata/fra* /opt/local/share/tessdata
cp tesseract-ocr/tessdata/deu* /opt/local/share/tessdata
cp tesseract-ocr/tessdata/ara* /opt/local/share/tessdata

Note: fra and some others are included in the english training data

Use Tesseract

Example on jpg (Png does not seem to work, possible missing a feature upstream)

tesseract test6-000005.jpg test6-000005 -l deu

Optionally you can add makebox or hocr to the command to create a box or hocr file containing the coordinates of each recognized character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment