Moses is a statistical machine translation system that allows you to automatically train
machine translation models
sudo apt-get install g++ git subversion automake libtool zlib1g-dev libboost-all-dev libbz2-dev liblzma-dev python-dev libtcmalloc-minimal4
Make a directory where all works related to machine translation will be present.
mdkir ~/MT
cd ~/MT
wget https://dl.bintray.com/boostorg/release/1.64.0/source/boost_1_64_0.tar.gz
tar zxvf boost_1_64_0.tar.gz
cd boost_1_64_0/
./bootstrap.sh
./b2 -j5 --prefix=$PWD --libdir=$PWD/lib64 --layout=system link=static install || echo FAILURE
Download Moses decoder from github and extract to the directory ~/MT
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder/
- Install moses
./bjam -j5
If you installed moses successfully, you will be able to see the options available with
bjam
./bjam --help
./bjam --with-boost=~/MT/boost_1_64_0/ -j5
- Install giza
Download giza
git clone https://github.com/moses-smt/giza-pp.git
cd giza-pp
make
Navigate into
mosesdecoder
directory and createtools
in folder
cd ~/MT/mosesedecoder
mkdir tools
- Copy components to the
tools
folder
cp ../giza-pp/GIZA++-v2/GIZA++ ../giza-pp-master/GIZA++-v2/snt2cooc.out ../giza-pp-master/mkcls-v2/mkcls tools/
Installing SRILM
TODO
Make a new directory
corpus
in the main folder~/MT/
Make a new directory
training
inside the folder `~/MT/corpus/
Add parralell data into
~/MT/corpus/training/
. example :data.en
,data.kh
Navigate into
corpus
folder :cd ~/MT/corpus/
- English Tokenization
bash ../mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < training/data.en > data.tok.en
- Khmer Tokenization
bash cp training/data.kh ~/MT/corpus/data.tok.kh
Navigate into
corpus
folder :cd ~/MT/corpus/
- English Truecase model
../mosesdecoder/scripts/recaser/train-truecaser.perl --model truecase-model.en --corpus data.tok.en
- Khmer Truecase model (Skip)
Navigate into
corpus
folder :cd ~/MT/corpus/
- Truecasing English
../mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en < data.tok.en > data.true.en
- Truecasing Khmer
cp data.tok.kh data.true.kh
../mosesdecoder/scripts/training/clean-corpus-n.perl data.true en kh data.clean 1 80
Naviagate into
~/MT
and create a new foldermodel1
cd ~/MT/
mkdir model1
cd model1
../mosesdecoder/scripts/training/train-model.perl -root-dir train -corpus ../corpus/data.clean -f en -e kh -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:~/MT/lm/data.blm.kh:8 -external-bin-dir ../mosesdecoder/tools >& training.out &