Skip to content

Instantly share code, notes, and snippets.

@putheakhem
Last active October 27, 2021 02:53
Show Gist options
  • Save putheakhem/9b7a98da27c9f94cac70c5ad20516644 to your computer and use it in GitHub Desktop.
Save putheakhem/9b7a98da27c9f94cac70c5ad20516644 to your computer and use it in GitHub Desktop.
Getting started with Machine Translation

Khmer MT

Moses is a statistical machine translation system that allows you to automatically train machine translation models

Before Installing moses install the following packages

  sudo apt-get install g++ git subversion automake libtool zlib1g-dev libboost-all-dev libbz2-dev liblzma-dev python-dev libtcmalloc-minimal4

Make a directory where all works related to machine translation will be present.

  mdkir ~/MT

Installing Boost

cd ~/MT
wget https://dl.bintray.com/boostorg/release/1.64.0/source/boost_1_64_0.tar.gz
tar zxvf boost_1_64_0.tar.gz
cd boost_1_64_0/
./bootstrap.sh
./b2 -j5 --prefix=$PWD --libdir=$PWD/lib64 --layout=system link=static install || echo FAILURE

Installation Moses

Download Moses decoder from github and extract to the directory ~/MT

git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder/
  1. Install moses
  ./bjam -j5

If you installed moses successfully, you will be able to see the options available with bjam

  ./bjam --help
  ./bjam --with-boost=~/MT/boost_1_64_0/ -j5
  1. Install giza

Download giza

  git clone https://github.com/moses-smt/giza-pp.git
  cd giza-pp
  make 

Navigate into mosesdecoder directory and create tools in folder

  cd ~/MT/mosesedecoder
  mkdir tools
  1. Copy components to the tools folder
  cp ../giza-pp/GIZA++-v2/GIZA++ ../giza-pp-master/GIZA++-v2/snt2cooc.out ../giza-pp-master/mkcls-v2/mkcls tools/

Installing SRILM

TODO

Training the Translation System

Make a new directory corpus in the main folder ~/MT/

Make a new directory training inside the folder `~/MT/corpus/

Add parralell data into ~/MT/corpus/training/. example : data.en, data.kh

Pre-Process Corpora

Tokenization

Navigate into corpus folder : cd ~/MT/corpus/

  1. English Tokenization bash ../mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < training/data.en > data.tok.en
  2. Khmer Tokenization bash cp training/data.kh ~/MT/corpus/data.tok.kh
Create Trucase model

Navigate into corpus folder : cd ~/MT/corpus/

  1. English Truecase model
 ../mosesdecoder/scripts/recaser/train-truecaser.perl --model truecase-model.en --corpus data.tok.en
  1. Khmer Truecase model (Skip)
Truecasing

Navigate into corpus folder : cd ~/MT/corpus/

  1. Truecasing English
  ../mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en < data.tok.en > data.true.en
  1. Truecasing Khmer
  cp data.tok.kh data.true.kh
Cleaning of English and Khmer
   ../mosesdecoder/scripts/training/clean-corpus-n.perl data.true en kh data.clean 1 80
Training

Naviagate into ~/MT and create a new folder model1

  cd ~/MT/
  mkdir model1
  cd model1
  ../mosesdecoder/scripts/training/train-model.perl -root-dir train -corpus  ../corpus/data.clean -f en -e kh -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:~/MT/lm/data.blm.kh:8 -external-bin-dir ../mosesdecoder/tools >& training.out &
  
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment