Last active
August 29, 2015 14:19
-
-
Save SnowMasaya/e30ccd5057b3887854d1 to your computer and use it in GitHub Desktop.
Kaldiに関する処理を日本語のドキュメントでまとめてみた(データ準備編)2 ref: http://qiita.com/GushiSnow/items/a24cad7231de341738ee
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#silのみ出力 | |
utils/make_lexicon_fst_silprob.pl $tmpdir/lexiconp_silprob_disambig.txt $s rcdir/silprob.txt $silphone '#'$ndisambig | \ | |
#置き換え処理 | |
sed 's=\#[0-9][0-9]*=<eps>=g' | \for indirect one, use twice the learning rate | |
#音素を入力、単語を出力として重み付き状態変換器の作成 | |
fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \ | |
--keep_isymbols=false --keep_osymbols=false | \ | |
#14:重み付き状態変換器をソート:下記に例を示す | |
fstarcsort --sort_type=olabel > $dir/L.fst || exit 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fstprint --isymbols=./data/lang/phones.txt(音素ファイル) --osymbols=./data/lang/words.txt(単語ファイル) ../../../test_japanese/data/lang_test_tg/L.fst(fstファイル) test.txt(出力されるファイル) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dot -Tjpg test.dot > test.jpg | |
xli test.jpg |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cat $lmdir/lm.arpa | \ | |
grep -v '<s> <s>' | \ | |
grep -v '</s> <s>' | \ | |
grep -v '</s> </s>' | \ | |
arpa2fst - | fstprint | \ | |
utils/remove_oovs.pl $tmpdir/oovs.txt | \ | |
utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=$test/words.txt \ | |
--osymbols=$test/words.txt --keep_isymbols=false --keep_osymbols=false | \ | |
fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst | |
fstisstochastic $test/G.fst | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
awk '{if(NF==1){ printf("0 0 %s %s\n", $1,$1); }} END{print "0 0 #0 #0"; print "0";}' \ | |
< "$lexicon" >$tmpdir/g/select_empty.fst.txt | |
fstcompile --isymbols=$test/words.txt --osymbols=$test/words.txt \ | |
$tmpdir/g/select_empty.fst.txt | \ | |
fstarcsort --sort_type=olabel | fstcompose - $test/G.fst > $tmpdir/g/empty_words.fst | |
fstinfo $tmpdir/g/empty_words.fst | grep cyclic | grep -w 'y' && | |
echo "Language model has cycles with empty words" && exit 1 | |
rm -rf $tmpdir |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment