Created October 22, 2018 17:42
Gist holds the step by step execution of creating a translator once you have a parallel corpus

Run the following in your terminal

## To download the OpenNMT library
git clone
cd OpenNMT-py

## Installing the requiremnets
pip install -r requirements.txt

## Downloading the data (if wget not avilable, then download using a browser by going to the link)

## Extraing the files
mkdir ./data/fr-en
tar -xf fr-en.tgz -C data/fr-en/

## Clone this repository to get the python script for creating train, test, val
git clone
cd NMT-with-OpenNMT-Py

## This will create 3 divisons in the dataset

## Moving back to main folder
cd ..

## Install `perl`
## Run the default tokenizer script to tokenize the dataset
## You can use your own tokenizer, but remember to use the same tokenizer in production
## NOTE: This will take a 5-10 minutes because of the size of the dataset
perl ./tools/tokenizer.perl -a -no-escape -l en < ./data/fr-en/train.en > ./data/fr-en/train.en.atok
perl ./tools/tokenizer.perl -a -no-escape -l fr < ./data/fr-en/ > ./data/fr-en/
perl ./tools/tokenizer.perl -a -no-escape -l en < ./data/fr-en/val.en > ./data/fr-en/val.en.atok
perl ./tools/tokenizer.perl -a -no-escape -l fr < ./data/fr-en/ > ./data/fr-en/
perl ./tools/tokenizer.perl -a -no-escape -l en < ./data/fr-en/test.en > ./data/fr-en/test.en.atok
perl ./tools/tokenizer.perl -a -no-escape -l fr < ./data/fr-en/ > ./data/fr-en/

## Creating the vocab using the OpenNMT-Py Preprocessing
## NOTE: This will take a 5-10 minutes because of the size of the dataset
python3 -train_src data/fr-en/train.en.atok -train_tgt data/fr-en/ -valid_src data/fr-en/val.en.atok -valid_tgt data/fr-en/ -save_data data/fr-en/fr-en.atok.low -lower

## Training the model

## ====   If using GPU    ==== ##
python3 -data data/fr-en/fr-en.atok.low -save_model data/fr-en/ckt/fr-en_model -gpu_ranks 0 -enc_layers 2 -dec_layer 2 -optim adam -learning_rate 0.001 -learning_rate_decay 1 -train_steps 10000

## ====   Only CPU    ==== ##
python3 -data data/fr-en/fr-en.atok.low -save_model data/fr-en/ckt/fr-en_model -enc_layers 2 -dec_layer 2 -optim adam -learning_rate 0.001 -learning_rate_decay 1 -train_steps 10000

## Translate on trained model
python -gpu 0 -model data/fr-en/ckt/fr-en_model_* -src data/fr-en/test.en.atok -tgt data/fr-en/ -replace_unk -verbose -output multi30k.test.pred.atok

Thats it!!

Hello Rajat,
I am interested by the NMT and I wanted to test it using your code. But I am stuck in 2 places:

  1. in OpenNMT-py/tools I cannot find any file tokenizer.perl to be called by perl ./tools/tokenizer.perl
  2. .in OpenNMT-py I cannot find, is it rather ?
    Thanks for your answer
    Philippe Mercier

