This is a bief explanation of how to build a new model for ilive tts.
the input of model generation process will be a set of directories containing required files for model generation, and these folders can be listed as the follwing:
- textdir contains set of diactized arabic text sentences each in a separated file.
- wavdir contains set of wav files each represents arabic pronounciation of corresponding sentence in text dir
- labdir contains arabic text pronounciation timelaps for each speech segment for each corresponding text and wav file
- languagedir we create that contain pre-processing outputs
- phonemizerdir contain set of rules of pronounciation of arabic language phonemes.
the input acquired from linguistics team needs some pre-processing to generate intermediate files needed while generating the model.
- gather all unique words in all scentences in one file and put it in language folder.
# enter text directory
cd text
# put all sentences in one file in language directory
awk 'FNR==1{print ""}1' *.txt > ../language/text_ar.txt
# enter language directory
cd ../language
# get unique words from all sentences and put them in one file
tr -s [:space:] \\n < text_ar.txt | sort | uniq > unique_words.txt
# then manually remove space` ` and `,` if exist in the unique words file- in labdirectory, in all files, replace allSILandSSILwith_.
# enter lab directory
cd lab
# replace all SSIL with _ in all files
sed -i -- 's/SSIL/_/g' *
# replace all SIL with _ in all files
sed -i -- 's/SIL/_/g' *- in project phonemizer-continuousresources [src/main/resources/com/univox], replace [allophones.ar_SA.xml,ArabicPhonemesMap,ArabicScript] with files from phonemizer directory, and also in both the jar file and the rar file inside it.
- in project marytts:marytts-runtime/src/main/resourcesandmarytts-runtime/src/main/java/marytts/com/univoxreplace [ArabicPhonemesMap,ArabicScript,allophones.ar_SA.xml] with files from the previos step.
- put the unique words file in project phonemizer-continuousbase folder.
- in project phonemizer-continuous:src/main/java/com/univox/PhonemizerMain.javamake sure that the file name used iis the same as your unique words file name in the base filder.
String filename="unique_words.txt";- run project phonemizer-continuousas a java application. [this will take a few minutes]
- move the output of phonemizer-continuousproject output namedunique_words.phtolanguagedirectory and name itar.txt.
- in languagefolder, inar.txtfile replace all__withfunctional(yes: with a pre space), then remove all the remaining_from the file.
# enter language folder
cd language
# replace "__" with " functional"
sed -i -- 's/__/ functional/g' ar.txt
# remove remaining '_' in the file
sed 's/_//g' ar.txt- in maryttsproject, deletetargetfolder, then re-createmarytts
# enter marytts directory
cd marytts
mvn -Dmaven.test.skip=true install- in language folder run transcription.shwhich results frommaryttsbuild in the previous step.
cd language
transcription.shthis will open a GUI tool that will require few steps:-
- asks for alophones.ar_SA.xmlfile, select it from its location.
- then from filemenu selectopenand then selectar.txtfile fromlanguagedirectory.
- check all words, none should be in red, and if so this indicates and error.
- click train and predictbutton.
- from filemenu, selectsave. this will result in saving few files inlanguagedirectory.
- close the gui tool.
- in maryttsproject, delete target folder.
- in maryttsproject, in directorymarytts-languages/marytts-lang-ar, delete target folder.
- in maryttsproject, in directorymarytts-languages/marytts-lang-ar/src/main/resources/marytts/language/ar/lexiconreplace the files in it with output files from thetranscription.shtool.
- in the directory from the previous step rename allophones.ar_SA.xmltoallophones.ar.xmland also remove the_SAfrom the taglanginside the file.
- in project marytts: dirmarytts-language/marytts-lang-ar/lib/modules/ar/lexiconrepace the two files [allophones.ar.xml,ar] with modifiedall_phones.ar.xmlfile from the previos step andar.txtfile fromlanguagedirectory afrer being renamed toaronly
- in project marytts: dirmarytts-language/marytts-lang-artest that everything is okay.
mvn test- in project marytts:diruser-dictionaries, replaceuserdict-ar.txtwith the output of thephonemizer-continuousproject which isunique_words.phafter it being renamed touserdict-ar.txt.
- edit the filed moved to user-dictionaries, replace remove__and replace_with|
sed 's/__//g' userdict-ar.txt
sed 's/_/| /g' userdict-ar.txt- in maryttsproject, deletetargetfolder, then buildmarytts
# enter marytts directory
cd marytts
mvn -Dmaven.test.skip=true install- open maryttsprojcet -univox workspace-in eclipse and add build configuration for server and client as the following.
- run server: run configuration with ~/git/maryttsas mary base
- run DatabaseImportMain: run configuration with db directory as database base folder.