- split source language documents into sentences before passing them through the model.
model names use the following format: Helsinki-NLP/opus-mt-{src}-{tgt}
fully captivalized values for src and tgt reference group names in the following lookup:
if src is in all caps, the model supports multiple input languages,
you can figure out which ones by looking at the model card.
if tgt is in all caps, the model can output multiple languages. You can specify a language code by prepending the desired output language to the src_text like src_text = ['>>zh<< sentence 1', '>>zh<< sentence 2'] (TODO: EXAMPLE)
Sometimes, langauges are trained on a random collection of languages that do not resolve to a group. In this case, _ is used as a separator for src or tgt, as in
'Helsinki-NLP/opus-mt-en0el0es0fi-en0el0es0fi'
The language codes used to name models are inconsistent. two digit codes can usually be found here, three digit codes require googling f"language code {code}".
codes formatted like, es_AR
are usually code_{region}. That one is spanish documents from Argentina.
You can see possible language codes in tokenizer.supported_language_codes
Examples:
- 'Helsinki-NLP/opus-mt-INSULAR_CELTIC-en': from all insular celtic languages to english.
- 'Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU', from all NORTH_EU languages to all NORTH_EU languages, use a special language code like >>de<< to specify output language.
- 'Helsinki-NLP/opus-mt-ROMANCE-en': translates from many romance languages to english
GROUP_MEMBERS = {
'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
'INSULAR_CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
}