Connect to AWS AIM instance, then follow these steps:
- Update system libraries
sudo yum -y update
sudo yum -y upgrade
2.Compile Leptonica
cd ~
sudo yum install clang -y
sudo yum install libpng-devel libtiff-devel zlib-devel libwebp-devel libjpeg-turbo-devel -y
wget https://github.com/DanBloomberg/leptonica/releases/download/1.75.1/leptonica-1.75.1.tar.gz
tar -xzvf leptonica-1.75.1.tar.gz
cd leptonica-1.75.1
./configure && make && sudo make install
3.Compile autoconf-archive
cd ~
wget http://mirror.squ.edu.om/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xz
tar -xvf autoconf-archive-2017.09.28.tar.xz
cd autoconf-archive-2017.09.28
./configure && make && sudo make install
sudo cp m4/* /usr/share/aclocal/
4.Compile tesseract
cd ~
sudo yum install git-core libtool pkgconfig -y
git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git tesseract-ocr
cd tesseract-ocr
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure
make
sudo make install
5.Install Python packages
cd ~
virtualenv ~/tfenv
source ~/tfenv/bin/activate
pip install pillow
pip install cython
pip install opencv-python==3.4.2.16
pip install tesserocr
pip install pytesseract
6.Copy Files
cd ~
mkdir tesseract-aws
cd tesseract-aws
cp /usr/local/bin/tesseract .
mkdir lib
cp /usr/local/lib/libtesseract.so.5 lib/
cp /usr/local/lib/liblept.so.5 lib/
cp /usr/lib64/libjpeg.so.62 lib/
cp /usr/lib64/libwebp.so.4 lib/
cp /usr/lib64/libstdc++.so.6 lib/
cp -r ~/tfenv/lib/python2.7/site-packages/* .
cp -r ~/tfenv/lib64/python2.7/site-packages/* .
mkdir tessdata
cd tessdata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata
mkdir configs
cp /usr/local/share/tessdata/configs/pdf configs/
cp /usr/local/share/tessdata/pdf.ttf .
cd ..
7.Create a python file for testing
vim lambda_function.py
see the file below and set the path
import pytesseract
import PIL.Image
import io
from base64 import b64decode
def lambda_handler(event, context):
pytesseract.pytesseract.tesseract_cmd = "/var/task/tesseract"
binary = b64decode(event['image64'])
image = PIL.Image.open(io.BytesIO(binary))
text = str(pytesseract.image_to_string(image))
return {'text' : text}
8.Create a zip file
zip -r ~/tesseract-aws.zip *
- Download the zip file from S3 to local machine
scp -i key.pem ec2-user@AWS_EC2_INSTANCE_IP:/home/ec2-user/tesseract-aws/tesseract-aws.zip .
- Upload this file to an EC2 instance and then use its url in Lambda
11.ADD Environment variable in Lambda with
Key TESSDATA_PREFIX
Value /var/task/tessdata
-
For Testing set the Handler to: lambda_function.lambda_handler
-
Test it using a json like this
{
"image64": ""
}
It should give output as
ABCDE
Refrences
https://stackoverflow.com/questions/33588262/tesseract-ocr-on-aws-lambda-via-virtualenv
https://gist.github.com/barbolo/e59aa45ec8e425a26ec4da1086acfbc7
This is not working :
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata
fails .....