Skip to content

Instantly share code, notes, and snippets.

@sshleifer
Created May 12, 2020 21:45
Show Gist options
  • Save sshleifer/e13e60e148a8780abf0d3669792f7196 to your computer and use it in GitHub Desktop.
Save sshleifer/e13e60e148a8780abf0d3669792f7196 to your computer and use it in GitHub Desktop.

MarianMTModel Best Practices:

  • split source language documents into sentences before passing them through the model.

Model Naming

model names use the following format: Helsinki-NLP/opus-mt-{src}-{tgt} fully captivalized values for src and tgt reference group names in the following lookup: if src is in all caps, the model supports multiple input languages, you can figure out which ones by looking at the model card.

if tgt is in all caps, the model can output multiple languages. You can specify a language code by prepending the desired output language to the src_text like src_text = ['>>zh<< sentence 1', '>>zh<< sentence 2'] (TODO: EXAMPLE)

Sometimes, langauges are trained on a random collection of languages that do not resolve to a group. In this case, _ is used as a separator for src or tgt, as in

'Helsinki-NLP/opus-mt-en0el0es0fi-en0el0es0fi'

The language codes used to name models are inconsistent. two digit codes can usually be found here, three digit codes require googling f"language code {code}". codes formatted like, es_AR are usually code_{region}. That one is spanish documents from Argentina. You can see possible language codes in tokenizer.supported_language_codes

Examples:

  • 'Helsinki-NLP/opus-mt-INSULAR_CELTIC-en': from all insular celtic languages to english.
  • 'Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU', from all NORTH_EU languages to all NORTH_EU languages, use a special language code like >>de<< to specify output language.
  • 'Helsinki-NLP/opus-mt-ROMANCE-en': translates from many romance languages to english
GROUP_MEMBERS = {
 'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
 'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'], 
 'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'], 
 'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'], 
 'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'], 
 'INSULAR_CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
}

Conversion Utilities

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment