- Download lots of fonts (eg.,
.ttf
files) git clone https://github.com/tesseract-ocr/tesstrain/
git clone https://github.com/tesseract-ocr/langdata_lstm
- Install Tesseract
- Generate training data:
cd src python -m tesstrain \ --langdata_dir /path/to/langdata_lstm \ --linedata_only \ --fonts_dir /path/to/fonts \ --lang <yourlang> \ --maxpages <N> \ --save_box_tiff \ --distort_image \ --fontlist '<your font 1> <your font 2> ...' \ --output_dir train1
- Split
train/*.training_files.txt
into two files*.training_files.txt
and*.eval_files.txt
(eg. 80 %, 20 % split) - Create language
.lstm
file:combine_tessdata -e <yourlang>.traineddata /some/path/<yourlang>.lstm
- Train (eg.
<N> = 1000
):lstmtraining \ --continue_from /some/path/<yourlang>.lstm \ --model_output <your_new_model_name> \ --traineddata /path/to/<yourlang>.traineddata \ --train_listfile train/*.training_files.txt \ --randomly_rotate \ --max_iterations <N>
- Eval:
BCER = Character error rate, BWER = Word error ratelstmeval \ --eval_listfile train/*.eval_files.txt \ --traineddata /path/to/<yourlang>.traineddata \ --model <your_new_model_name>_checkpoint
- Loop 8. and 9. with increasing
<N>
until you are happy with the eval error rates. -
lstmtraining \ --stop_training \ --continue_from <your_new_model_name>checkpoint \ --traineddata /path/to/<yourlang>.traineddata \ --model_output <your_new_model_name>.traineddata
Last active
August 13, 2024 02:38
-
-
Save jonashaag/8d176a7c706c217716add9fb7aed85ea to your computer and use it in GitHub Desktop.
Tesseract LSTM fine-tuning how-to
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment