Skip to content

Instantly share code, notes, and snippets.

@R1j1t
Created October 22, 2018 17:42
Show Gist options
  • Save R1j1t/e3a215a2ef68906dac00daa906f0e3d4 to your computer and use it in GitHub Desktop.
Save R1j1t/e3a215a2ef68906dac00daa906f0e3d4 to your computer and use it in GitHub Desktop.
Gist holds the step by step execution of creating a translator once you have a parallel corpus

Run the following in your terminal

## To download the OpenNMT library
git clone https://github.com/OpenNMT/OpenNMT-py
cd OpenNMT-py

## Installing the requiremnets
pip install -r requirements.txt

## Downloading the data (if wget not avilable, then download using a browser by going to the link)
wget http://www.statmt.org/europarl/v7/fr-en.tgz

## Extraing the files
mkdir ./data/fr-en
tar -xf fr-en.tgz -C data/fr-en/

## Clone this repository to get the python script for creating train, test, val
git clone https://github.com/R1j1t/NMT-with-OpenNMT-Py.git
cd NMT-with-OpenNMT-Py

## This will create 3 divisons in the dataset
python3 dataset_split.py

## Moving back to main folder
cd ..

## Install `perl`
## Run the default tokenizer script to tokenize the dataset
## You can use your own tokenizer, but remember to use the same tokenizer in production
## NOTE: This will take a 5-10 minutes because of the size of the dataset
perl ./tools/tokenizer.perl -a -no-escape -l en < ./data/fr-en/train.en > ./data/fr-en/train.en.atok
perl ./tools/tokenizer.perl -a -no-escape -l fr < ./data/fr-en/train.fr > ./data/fr-en/train.fr.atok
perl ./tools/tokenizer.perl -a -no-escape -l en < ./data/fr-en/val.en > ./data/fr-en/val.en.atok
perl ./tools/tokenizer.perl -a -no-escape -l fr < ./data/fr-en/val.fr > ./data/fr-en/val.fr.atok
perl ./tools/tokenizer.perl -a -no-escape -l en < ./data/fr-en/test.en > ./data/fr-en/test.en.atok
perl ./tools/tokenizer.perl -a -no-escape -l fr < ./data/fr-en/test.fr > ./data/fr-en/test.fr.atok

## Creating the vocab using the OpenNMT-Py Preprocessing
## NOTE: This will take a 5-10 minutes because of the size of the dataset
python3 preprocess.py -train_src data/fr-en/train.en.atok -train_tgt data/fr-en/train.fr.atok -valid_src data/fr-en/val.en.atok -valid_tgt data/fr-en/val.fr.atok -save_data data/fr-en/fr-en.atok.low -lower

## Training the model

## ====   If using GPU    ==== ##
python3 train.py -data data/fr-en/fr-en.atok.low -save_model data/fr-en/ckt/fr-en_model -gpu_ranks 0 -enc_layers 2 -dec_layer 2 -optim adam -learning_rate 0.001 -learning_rate_decay 1 -train_steps 10000

## ====   Only CPU    ==== ##
python3 train.py -data data/fr-en/fr-en.atok.low -save_model data/fr-en/ckt/fr-en_model -enc_layers 2 -dec_layer 2 -optim adam -learning_rate 0.001 -learning_rate_decay 1 -train_steps 10000

## Translate on trained model
python translate.py -gpu 0 -model data/fr-en/ckt/fr-en_model_*_e13.pt -src data/fr-en/test.en.atok -tgt data/fr-en/test.fr.atok -replace_unk -verbose -output multi30k.test.pred.atok

Thats it!!

@PhilippeRossi
Copy link

Hello Rajat,
I am interested by the NMT and I wanted to test it using your code. But I am stuck in 2 places:

  1. in OpenNMT-py/tools I cannot find any file tokenizer.perl to be called by perl ./tools/tokenizer.perl
  2. .in OpenNMT-py I cannot find preprocess.py, is it rather build_vocab.py ?
    Thanks for your answer
    Regards
    Philippe Mercier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment