Analysis the source code of merlin

Raw

声学特征提取

本文介绍如何提取提取声学特征用于Merlin训练。在语音合成中，属于声码器(vocoder)的内容。

Merlin可以使用两种vocoder，STRAIGHT或WORLD。WORLD的目标是提取60-dim MGC, variable-dim BAP (BAP dim: 1 for 16Khz, 5 for 48Khz), 1-dim LF0；STRAIGHT的目标是提取60-dim MGC, 25-dim BAP, 1-dim LF0。

新版本的WORLD_v2还在开发中，目标是提取60-dim MGC, 5-dim BAP, 1-dim LF0(MGC和BAP的维度支持微调)。

由于STRAIGHT的使用有严格的证书限制，本文，主要介绍WORLD。

代码

代码路径为：https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/vocoder/world/extract_features_for_merlin.py

在这个代码中，主要是调用world的analysis和sptk的x2x工具。

输入

在调用时需要指定四个参数，如下所示：

python extract_features_for_merlin.py <path_to_merlin_dir> <path_to_wav_dir> <path_to_feat_dir> <sampling frequency>

<path_to_merlin_dir>    Merlin安装路径，借此，可以定位到world和sptk路径
<path_to_wav_dir>       原始音频路径
<path_to_feat_dir>      提取出的特征所保存的路径
<sampling frequency>    采样率

输出

在<path_to_feat_dir>路径下创建三个目录：

|-- bap
|-- lf0
`-- mgc

分别用于保存不同类型的特征。

Raw

forced_alignment.md

forced alignment

前文利用festival提取了文本标签，历经festival -b <scm_file>、dumpfeats、归一化等操作，形成了归一化的full context labels。本文，我们将介绍如何使用HTK工具，利用full context labels和wav实现对齐。

注意：Merlin提供了state和phone两种级别的对齐。由于state对齐性能更好，本文，我们只考虑如何进行state级别的对齐。

初探

对齐脚本位于：https://github.com/CSTR-Edinburgh/merlin/tree/master/misc/scripts/alignment/state_align

目录结构如下：

├── binary_io.py
├── forced_alignment.py
├── htk_io.py
├── htkmfc.py
├── mean_variance_norm.py
├── prepare_labels_from_txt.sh
├── README.md
├── run_aligner.sh
└── setup.sh

运行方式为：

python $aligner/forced_alignment.py

不带任何参数，如需修改，可通过sed命令修改，例如：

sed -i s#'HTKDIR =.*'#'HTKDIR = "'$HTKDIR'"'# $aligner/forced_alignment.py                       # HTK目录
sed -i s#'work_dir =.*'#'work_dir = "'$WorkDir/$lab_dir'"'# $aligner/forced_alignment.py         # 工作路径，里面包含一个子目录"label_no_align"，为未对齐的标签
sed -i s#'wav_dir =.*'#'wav_dir = "'$wav_dir'"'# $aligner/forced_alignment.py                    # 音频所在路径

未对齐标签格式如下所示，不含有时间信息(time steps)：

$ cat label_no_align/arctic_b0001.lab

x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1
x^sil-g+ae=d@1_3/A:0_0_0/B:1-1-3@1-1&1-7#1-5$1-3!0-1;0-1|ae/C:1+1+2/D:0_0/E:content+1@1+5&1+4#0+1/F:content_1/G:0_0/H:7=5@1=1|L-L%/I:0=0/J:7+5-1
sil^g-ae+d=d@2_2/A:0_0_0/B:1-1-3@1-1&1-7#1-5$1-3!0-1;0-1|ae/C:1+1+2/D:0_0/E:content+1@1+5&1+4#0+1/F:content_1/G:0_0/H:7=5@1=1|L-L%/I:0=0/J:7+5-1
g^ae-d+d=uw@3_1/A:0_0_0/B:1-1-3@1-1&1-7#1-5$1-3!0-1;0-1|ae/C:1+1+2/D:0_0/E:content+1@1+5&1+4#0+1/F:content_1/G:0_0/H:7=5@1=1|L-L%/I:0=0/J:7+5-1
..................后略..................

输出：

将对齐之后的标签输出到<work_dir>/<lab_align_dir>目录中。

$ cat label_state_align/arctic_b0001.lab

0 50000 x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1[2]
50000 100000 x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1[3]
100000 300000 x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1[4]
300000 1450000 x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1[5]
1450000 1750000 x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1[6]
1750000 1800000 x^sil-g+ae=d@1_3/A:0_0_0/B:1-1-3@1-1&1-7#1-5$1-3!0-1;0-1|ae/C:1+1+2/D:0_0/E:content+1@1+5&1+4#0+1/F:content_1/G:0_0/H:7=5@1=1|L-L%/I:0=0/J:7+5-1[2]
1800000 1850000 x^sil-g+ae=d@1_3/A:0_0_0/B:1-1-3@1-1&1-7#1-5$1-3!0-1;0-1|ae/C:1+1+2/D:0_0/E:content+1@1+5&1+4#0+1/F:content_1/G:0_0/H:7=5@1=1|L-L%/I:0=0/J:7+5-1[3]
..................后略..................

这两个文件完整内容可访问：未对齐、已对齐。对应英文文本为：Gad, do I remember it.

原理

上述对齐使用到了HTK提供的工具，包括：HCompV, HCopy, HERest, HHEd, HVite.

使用的先后顺序为：HCopy -> HCompV -> HERest -> HHEd -> HVite。下面我们先对这几个工具简单介绍。

工具	说明
HCopy	参数化数据，即，提特征，将wav格式的语音文件转化为包含若干特征矢量的特征文件
HCompV	初始化模型参数
HERest	模型训练，参数估计
HHEd	模型定义编辑器
HVite	解码，维特比算法

对齐标签将用于后续训练时长模型和声学模型，详细下文介绍。

Raw

merlin_src_analysis.md

Merlin Source Code Analysis

本文简单分析Merlin的一些源码。用于更深入的学习Merlin。

genScmFile.py

代码路径：https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/frontend/utils/genScmFile.py

作用是对文本文件进行格式转换，转换成文本标准格式。标准格式由3类文件组成：utt文件，scheme文件，scp文件。utt为空文件夹，供后续操作；scheme文件为文本和后续产生的utt文件之间的对应关系；scp文件为文件列表（无后缀）。

输入

 <in_txt_dir/in_txt_file>    为原始文本所在目录（每个文件以.txt结尾），或者原始文本
 <out_utt_dir>               之后utt产生的路径
 <out_scm_file>              生成的scm文件
 <out_file_id_list>          生成的scp文件

输出

常见 <out_utt_dir>空文件夹
生成文件名称为 scm文件，内容如下所示：

(utt.save (utt.synth (Utterance Text "Hello world." )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_001.utt")
(utt.save (utt.synth (Utterance Text "Hi, this is a demo voice from Merlin." )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_002.utt")
(utt.save (utt.synth (Utterance Text "Hope you guys enjoy free open-source voices from Merlin." )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_003.utt")
(utt.save (utt.synth (Utterance Text "I love you China." )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_004.utt")
(utt.save (utt.synth (Utterance Text "Are you OK?" )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_005.utt")
(utt.save (utt.synth (Utterance Text "I am comming from China." )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_006.utt")

生成文件列表scp，如下所示：

test_001
test_002
test_003
test_004
test_005
test_006

<in_txt_dir/in_txt_file>不仅可以是文本路径，也可以是单个文件，其格式如下：

( arctic_a0001 "Author of the danger trail, Philip Steels, etc." )
( arctic_a0002 "Not at this particular case, Tom, apologized Whittemore." )
( arctic_a0003 "For the twentieth time that evening the two men shook hands." )
( arctic_a0004 "Lord, but I'm glad to see you again, Phil." )

festival

festival -b <scheme_file>

作用：调用festival对文本进行批量处理。<scheme_file>为前一步所产生。(no interaction)

结果：生成utt文件。路径保存于<scheme_file>所指定的路径。

festival这一前端工具对文本进行了分析，例如：对文本Hello world.操作后的结果为：

EST_File utterance
DataType ascii
version 2
EST_Header_End
Features max_id 44 ; type Text ; iform "\"Hello world.\"" ; 
Stream_Items
1 id _1 ; name Hello ; whitespace "" ; prepunctuation "" ; 
2 id _2 ; name world ; punc . ; whitespace " " ; prepunctuation "" ; 
............此处省略n行............
End_of_Relation
Relation US_map ; ()
1 43 0 0 0 0
End_of_Relation
Relation Wave ; ()
1 44 0 0 0 0
End_of_Relation
End_of_Relations
End_of_Utterance

make_labels

代码路径：https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/frontend/festival_utt_to_lab/make_labels

功能

从utt中提取单音素（monophone），以及full context labels

用法

make_labels <labels_dir> <utts_dir> <dumpfeats> <scripts>

<labels_dir>      ## 新产生的标签所在的文件路径
<utts_dir>        ## utt文件所在路径
<dumpfeats>       ## 指向Festival的dumpfeats脚本，安装好festival后应该知道，常见为：{FESTDIR}/examples/dumpfeats
<scripts>         ## 下列脚本所在路径: extra_feats.scm label.feats label-full.awk  label-mono.awk

执行流程

在<labels_dir>文件夹中创建两个子目录，mono和full

对于<utts_dir>文件夹中的每个utt文件执行：

通过basename $utt .utt获得basename

调用dumpfeats提取特征：

dumpfeats	-eval		$scripts/extra_feats.scm \
		-relation 	Segment \
		-feats    	$scripts/label.feats \
		-output   	$labels/tmp \
		$utt

分别写入mono和full文件夹：

gawk -f $scripts/label-full.awk $labels/tmp > $labels/full/$base.lab; \
gawk -f $scripts/label-mono.awk $labels/tmp > $labels/mono/$base.lab; \

清理临时产生的文件：rm -f tmp

解释说明

dumpfeats为festival提供的工具，用于从utt中提取特征，详细如下：

Usage: dumpfeats [options] <utt_file_0> <utt_file_1> ...
  
  Dump features from a set of utterances
  
  Options
  -relation  <string>
             Relation from which the features have to be dumped from
  -output    <string>
             If output parameter contains a %s its treated as a skeleton
             e.g feats/%s.feats and multiple files will be created one
             each utterance.  If output doesn't contain %s the output
             is treated as a single file and all features and dumped in it.
  -feats     <string>
             If argument starts with a "(" it is treated as a list of
             features to dump, otherwise it is treated as a filename whose
             contents contain a set of features (without parenetheses).
  -eval      <ifile>
             A scheme file to be loaded before dumping.  This may contain
             dump specific features etc.  If filename starts with a left
             parenthis it it evaluated as lisp.
  -from_file <ifile>
             A file with a list of utterance files names (used when there
             are a very large number of files.

gawk为比sed更强大的文本操作命令。-f选项表示指定program文件：

-f file		 Specifies a filename to read the program from

详细program可见$scripts/label-full.awk和$scripts/label-mono.awk。

示例

我们以刚才通过文本Hello world.产生的utt为例，展示经过make_labels之后可以得到什么。

当前路径：/root/workspace/merlin_projects/step_by_step，这个路径中所含文件结构如下：

root@de-3879-ng-2-123705-3173223045-0f7q9:~/workspace/merlin_projects/step_by_step# tree ./
|-- scm.scm
|-- test_001.utt
|-- test_002.utt
|-- test_003.utt
|-- test_004.utt
|-- test_005.utt
`-- test_006.utt

dumpfeats

dumpfeats=/root/workspace/Python_Programs/merlin/tools/festival/examples/dumpfeats
scripts=/root/workspace/Python_Programs/merlin/misc/scripts/frontend/festival_utt_to_lab
labels=.
utt=test_001.utt
$dumpfeats -eval $scripts/extra_feats.scm \
-relation Segment \
-feats $scripts/label.feats \
-output $labels/tmp \
$utt

执行完后，将产生一个tmp新文件，内容如下：

0 pau hh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 1 0 0 0 0 0 3 0 content 0 2 0 3 0 2 0 ax 0 0.22 
pau hh ax 0 0 0 1 0 0 1 0 3 1 0 0 2 0 2 0 2 0 1 0 1 ax 0 content content 0 2 1 0 2 0 1 0 1 0 3 0 0 2 0 0 L-L% 3 2 1 0 0 0 0 0 3 0 content 0 2 0 3 0 2 0 l 0.22 0.27795401 
hh ax l 1 0 0 1 0 0 1 0 3 1 0 0 2 0 2 0 2 0 1 0 1 ax 0 content content 0 2 1 0 2 0 1 0 1 0 3 0 0 2 0 0 L-L% 3 2 1 0 0 0 0 3 3 content content 2 2 3 3 2 2 pau ow 0.27795401 0.32017601 
ax l ow 2 0 0 1 0 0 1 0 3 1 0 0 2 0 2 0 2 0 1 0 1 ax 0 content content 0 2 1 0 2 0 1 0 1 0 3 0 0 2 0 0 L-L% 3 2 1 0 1 0 1 3 1 content content 2 2 3 3 2 2 hh w 0.32017601 0.39965901 
l ow w 0 0 1 1 0 1 1 3 1 4 1 1 1 0 1 0 1 0 1 0 1 ow 0 content content 0 2 1 0 2 0 1 0 1 0 3 0 0 2 0 0 L-L% 3 2 1 0 1 0 1 3 4 content content 2 1 3 3 2 2 ax er 0.39965901 0.55004603 
ow w er 0 1 1 0 1 1 0 1 4 0 0 2 0 1 0 1 0 1 0 1 0 er content content 0 2 1 0 1 1 1 0 1 0 0 3 0 0 2 0 0 L-L% 3 2 1 1 1 1 1 1 4 content content 2 1 3 3 2 2 l l 0.55004603 0.62555099 
w er l 1 1 1 0 1 1 0 1 4 0 0 2 0 1 0 1 0 1 0 1 0 er content content 0 2 1 0 1 1 1 0 1 0 0 3 0 0 2 0 0 L-L% 3 2 1 1 1 1 1 4 4 content content 1 1 3 3 2 2 ow d 0.62555099 0.72588098 
er l d 2 1 1 0 1 1 0 1 4 0 0 2 0 1 0 1 0 1 0 1 0 er content content 0 2 1 0 1 1 1 0 1 0 0 3 0 0 2 0 0 L-L% 3 2 1 1 1 1 1 4 4 content content 1 1 3 3 2 2 w pau 0.72588098 0.81338102 
l d pau 3 1 1 0 1 1 0 1 4 0 0 2 0 1 0 1 0 1 0 1 0 er content content 0 2 1 0 1 1 1 0 1 0 0 3 0 0 2 0 0 L-L% 3 2 1 1 0 1 0 4 0 content 0 1 0 3 0 2 0 er 0 0.81338102 0.88916397 
d pau 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 1 1 0 1 0 4 0 content 0 1 0 3 0 2 0 l 0 0.88916397 1.33796

gawk

mkdir full
mkdir mono
base=test_001
gawk -f $scripts/label-full.awk $labels/tmp > $labels/full/$base.lab; \
gawk -f $scripts/label-mono.awk $labels/tmp > $labels/mono/$base.lab;

执行完后文件夹结构

|-- full
|   `-- test_001.lab
|-- mono
|   `-- test_001.lab
|-- scm.scm
|-- test_001.utt
|-- test_002.utt
|-- test_003.utt
|-- test_004.utt
|-- test_005.utt
|-- test_006.utt
`-- tmp

full/test_001.lab文件内容：

         0    2200000 x^x-pau+hh=ax@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_2/G:0_0/H:x=x@1=1|0/I:3=2/J:3+2-1
   2200000    2779540 x^pau-hh+ax=l@1_3/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   2779540    3201760 pau^hh-ax+l=ow@2_2/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   3201760    3996590 hh^ax-l+ow=w@3_1/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   3996590    5500460 ax^l-ow+w=er@1_1/A:0_0_3/B:1-1-1@2-1&2-2#1-2$1-2!0-1;0-1|ow/C:1+1+4/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   5500460    6255510 l^ow-w+er=l@1_4/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   6255510    7258810 ow^w-er+l=d@2_3/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   7258810    8133810 w^er-l+d=pau@3_2/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   8133810    8891640 er^l-d+pau=x@4_1/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   8891640   13379600 l^d-pau+x=x@x_x/A:1_1_4/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:content_1/E:x+x@x+x&x+x#x+x/F:0_0/G:3_2/H:x=x@1=1|0/I:0=0/J:3+2-1

mono/test_001.lab文件内容：

         0    2200000 pau
   2200000    2779540 hh
   2779540    3201760 ax
   3201760    3996590 l
   3996590    5500460 ow
   5500460    6255510 w
   6255510    7258810 er
   7258810    8133810 l
   8133810    8891640 d
   8891640   13379600 pau

normalize_lab_for_merlin.py

路径：https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/frontend/utils/normalize_lab_for_merlin.py

功能

将上面步骤产生的mono和full lab进行归一化(normalization)，以供merlin使用。

依据CSTR-Edinburgh/merlin#156 所言，这一代码主要做如下三件事：

Normalize duration to nearest divisible number by 5. Say 1.413 -> 1.415
Merge consecutive silence phones or pause phones to one.
Get rid of timestamps if required -- input format for HTK alignment

即：

将duration向最近邻靠近
对连续静音和暂停进行合并
如果需要，去掉timestamps

参数

Usage: python normalize_lab_for_merlin.py <input_lab_dir> <output_lab_dir> <label_style> <file_id_list_scp> <optional: write_time_stamps (1/0)>

<input_lab_dir>                          full标签所在路径
<output_lab_dir>                         归一化后标签保存路径
<label_style>                            使用何种对齐方式，支持phone_align, state_align
<file_id_list_scp>                       标签文件列表所在路径
<optional: write_time_stamps (1/0)>      是否写time stamps （可以省略，默认为1）

注意：上述过程暂时没有使用到mono label信息。
注意：对于训练数据需要指定label_style>=phone_align并且置write_time_stamps>=0对于测试数据，无此要求（推荐：label_style>=stete_align, <write_time_stamps>=1。

结果

归一化的结果保存在<output_lab_dir>，文件名称和原文件相同。内容如下：

0 2200000 x^x-sil+hh=ax@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_2/G:0_0/H:x=x@1=1|0/I:3=2/J:3+2-1
2200000 2800000 x^sil-hh+ax=l@1_3/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
2800000 3200000 sil^hh-ax+l=ow@2_2/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
3200000 4000000 hh^ax-l+ow=w@3_1/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
4000000 5500000 ax^l-ow+w=er@1_1/A:0_0_3/B:1-1-1@2-1&2-2#1-2$1-2!0-1;0-1|ow/C:1+1+4/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
5500000 6250000 l^ow-w+er=l@1_4/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
6250000 7250000 ow^w-er+l=d@2_3/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
7250000 8150000 w^er-l+d=sil@3_2/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
8150000 8900000 er^l-d+sil=x@4_1/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
8900000 13400000 l^d-sil+x=x@x_x/A:1_1_4/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:content_1/E:x+x@x+x&x+x#x+x/F:0_0/G:3_2/H:x=x@1=1|0/I:0=0/J:3+2-1

归一化之后的标签将输入到forced_alignment.py，实现对齐。具体如何对齐，我们将在后文介绍。

Raw

model.md

模型训练

本文介绍如何训练时长模型和声学模型。

配置文件

训练时长模型需要一个配置文件（后续的声学模型也一样）。一般而言，在一个样例配置文件上做一些修改即可。例如，训练DNN模型所用的样例配置文件为duration_demo.conf。

Merlin称这些不同的样例配置文件为recipes，全部recipes可见：https://github.com/CSTR-Edinburgh/merlin/tree/master/misc/recipes 。

配置文件，主要包含路径信息、对齐方式、问题集名称、模型结构、数据划分、执行过程等信息。

run_merlin.py

程序执行入口，路径为：https://github.com/CSTR-Edinburgh/merlin/blob/master/src/run_merlin.py

执行过程

按照配置文件中不同的sub-processes，将会有不同的执行方式。

顺序编号	代码	配置文件	默认值	解释
1	GenTestList	GenTestList	False	产生测试列表
2	AcousticModel	AcousticModel	False	声学模型
3	NORMLAB	NORMLAB	False	对标签进行归一化
4	MAKEDUR	MAKEDUR	False	产生输出的时长数据
5	MAKECMP	MAKECMP	False	产生输出的声学数据
6	NORMCMP	NORMCMP	False	归一化输出的声学数据
7	TRAINDNN	TRAINDNN	False	是否需要训练模型
8	GENBNFEA	GENBNFEA	False	产生瓶颈层特征
9	DNNGEN	DNNGEN	False	预测
10	GENWAV	GENWAV	False	产生wav音频
11	DurationModel	DurationModel	False	时长模型
12	CALMCD	CALMCD	False	模型评估

上述各个参数默认取值都为False，因此配置文件中只需要设置取值为True的参数即可。

训练时长模型，训练声学模型，测试时长模型，测试声学模型对应的配置文件，指定的执行流程，分别如下所示：

训练时长模型

NORMLAB  : True
MAKEDUR  : True
MAKECMP  : True
NORMCMP  : True

TRAINDNN : True
DNNGEN   : True

CALMCD   : True

训练声学模型

NORMLAB  : True
MAKECMP  : True
NORMCMP  : True

TRAINDNN : True
DNNGEN   : True

GENWAV   : True
CALMCD   : True

测试时长模型

NORMLAB: True

DNNGEN: True

测试声学模型

NORMLAB  : True
DNNGEN   : True

GENWAV   : True

shartoo commented Sep 7, 2018

你有没有兴趣重写下Merlin啊？我感觉Merlin写得太复杂，冗余了，而且Tensorflow也不是完全支持。Theano已经不更新了。

candlewill/extract_features_for_merlin.md

声学特征提取

代码

输入

输出

forced alignment

初探

原理

Merlin Source Code Analysis

genScmFile.py

输入

输出

festival

make_labels

功能

用法

执行流程

解释说明

示例

normalize_lab_for_merlin.py

功能

参数

结果

模型训练

配置文件

run_merlin.py

执行过程

shartoo commented Sep 7, 2018

shartoo commented Sep 7, 2018