I had such a bad time trying to create english-vietnamese parallel corpus from bilingual stories, but it sucks. It just wastes a lot of time. So I try to find out as much corpora as possible throughout the internet. My final dataset consists of about 2.5M pair of sentences. You can find all corpora here: link
I use OpenNMT to train my nmt model. Thanks Systran and HavardNLP for open source this project. It will help me and many others to understand how a industral translation system might work. The parameters of my model are as follow:
- Preprocesssing: Using
aggressive
tokenizer provided by OpenNMT