Getting started with Machine Translation

Khmer MT

Moses is a statistical machine translation system that allows you to automatically train machine translation models

Before Installing moses install the following packages

  sudo apt-get install g++ git subversion automake libtool zlib1g-dev libboost-all-dev libbz2-dev liblzma-dev python-dev libtcmalloc-minimal4

Make a directory where all works related to machine translation will be present.

  mdkir ~/MT

Installing Boost

cd ~/MT
wget https://dl.bintray.com/boostorg/release/1.64.0/source/boost_1_64_0.tar.gz
tar zxvf boost_1_64_0.tar.gz
cd boost_1_64_0/
./bootstrap.sh
./b2 -j5 --prefix=$PWD --libdir=$PWD/lib64 --layout=system link=static install || echo FAILURE

Installation Moses

Download Moses decoder from github and extract to the directory ~/MT

git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder/

Install moses

  ./bjam -j5

If you installed moses successfully, you will be able to see the options available with bjam

  ./bjam --help
  ./bjam --with-boost=~/MT/boost_1_64_0/ -j5

Install giza

Download giza

  git clone https://github.com/moses-smt/giza-pp.git
  cd giza-pp
  make

Navigate into mosesdecoder directory and create tools in folder

  cd ~/MT/mosesedecoder
  mkdir tools

Copy components to the tools folder

  cp ../giza-pp/GIZA++-v2/GIZA++ ../giza-pp-master/GIZA++-v2/snt2cooc.out ../giza-pp-master/mkcls-v2/mkcls tools/

Installing SRILM

TODO

Training the Translation System

Make a new directory corpus in the main folder ~/MT/

Make a new directory training inside the folder `~/MT/corpus/

Add parralell data into ~/MT/corpus/training/. example : data.en, data.kh

Pre-Process Corpora

Tokenization

Navigate into corpus folder : cd ~/MT/corpus/

English Tokenization bash ../mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < training/data.en > data.tok.en
Khmer Tokenization bash cp training/data.kh ~/MT/corpus/data.tok.kh

Create Trucase model

Navigate into corpus folder : cd ~/MT/corpus/

English Truecase model

 ../mosesdecoder/scripts/recaser/train-truecaser.perl --model truecase-model.en --corpus data.tok.en

Khmer Truecase model (Skip)

Truecasing

Navigate into corpus folder : cd ~/MT/corpus/

Truecasing English

  ../mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en < data.tok.en > data.true.en

Truecasing Khmer

  cp data.tok.kh data.true.kh

Cleaning of English and Khmer

   ../mosesdecoder/scripts/training/clean-corpus-n.perl data.true en kh data.clean 1 80

Training

Naviagate into ~/MT and create a new folder model1

  cd ~/MT/
  mkdir model1
  cd model1
  ../mosesdecoder/scripts/training/train-model.perl -root-dir train -corpus  ../corpus/data.clean -f en -e kh -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:~/MT/lm/data.blm.kh:8 -external-bin-dir ../mosesdecoder/tools >& training.out &

putheakhem/en-kh.md

Khmer MT

Before Installing moses install the following packages

Installing Boost

Installation Moses

Installing SRILM

Training the Translation System

Pre-Process Corpora

Tokenization

Create Trucase model

Truecasing

Cleaning of English and Khmer

Training