Skip to content

Instantly share code, notes, and snippets.

@tpmccallum
Last active January 19, 2021 00:56
Show Gist options
  • Save tpmccallum/a229e8f1948b4ee817b7d6dd629d97e0 to your computer and use it in GitHub Desktop.
Save tpmccallum/a229e8f1948b4ee817b7d6dd629d97e0 to your computer and use it in GitHub Desktop.
A guild to installing tesseract OCR on Centos7 and also configuring new languages

Installing Tesseract 4.0.0 on Centos7

The goal of this gist is to show how to use a CentOS7 system (with root access), to create a static compiled binary which can be copied over to, and used on, a CentOS7 system (with no root access).

Question: Why would we want to do this?

Answer: In some cases you might want to use tesseract on a machine via a cloud provider. For security reasons, specific machines on a specific cloud provider's infrastructure will not allow root access to the remove guest (you).

Solution: What we are about to do is log into a CentOS7 machine where we do have root access (this can be any machine i.e. a VM on your local machine etc.). We install all of the dependencies and create sufficuent executables on the machine with root access. We then copy just the tesseract executable over to the machine with no root access and make sure that the executables are in the no root access machine's path. Voilà!

The CentOS7 machine with root access

For this task you can use your own CentOS machine etc.

Install leptonica dependencies

sudo yum install zlib
sudo yum install zlib-devel
sudo yum install libjpeg
sudo yum install libjpeg-devel
sudo yum install libwebp
sudo yum install libwebp-devel
sudo yum install libtiff
sudo yum install libtiff-devel
sudo yum install libpng
sudo yum install libpng-devel

Install leptonica from source

cd /home/asureuser
git clone https://github.com/DanBloomberg/leptonica.git --depth 1
cd /home/azureuser/leptonica
./autogen.sh
./configure --prefix=/usr/local --disable-shared --enable-static --with-zlib --with-jpeg --with-libwebp  --with-libtiff --with-libpng
make
sudo make install
sudo ldconfig

Install tesseract from source

wget https://github.com/tesseract-ocr/tesseract/archive/4.0.0.tar.gz -O tesseract-4.0.0.tar.gz
tar xvvfz tesseract-4.0.0.tar.gz
cd tesseract-4.0.0
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure --prefix=/usr/local --disable-shared --enable-static --with-extra-libraries=/usr/local/lib/
make
sudo make install
sudo ldconfig

Copy executable to a non-root Centos7 machine

First, copy the tesseract binary (from the machine with root access) to the non-root Centos7 machine.

cd /home/azureuser/tess
scp -i ~/.ssh/key.pem -rp [email protected]:/usr/local/bin/tesseract .

Then ensure that the tesseract binary is in the system path on the non-root machine by adding the export statement to the ~/.bash_profile file

export PATH="$PATH:/home/azureuser/tess"

Load tesseract languages

Create a new location to store the traineddata and then export that location as the TESSDATA_PREFIX

mkdir -p /home/azureuser/tess/traineddata
export TESSDATA_PREFIX=/home/azureuser/tess/traineddata

Copy any of the .traineddata files available here that you need to the above TESSDATA_PREFIX location.

cd $TESSDATA_PREFIX
wget https://github.com/tesseract-ocr/tessdata/raw/master/fra.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata

Try tesseract

Grab an image such as [this french lunch menu example]( wget https://second-state.github.io/wasm-learning/faas/ocr/html/a_french_lunch_menu.png)

cd /home/azureuser/tess
wget https://second-state.github.io/wasm-learning/faas/ocr/html/a_french_lunch_menu.png

Then run the tesseract command and pass in the image as the first parameter

cd /home/azureuser/tess
./tesseract a_french_lunch_menu.png stdout --dpi 70 -l fra
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment