-
-
Save ymoslem/a414a0ead0d3e50f4d7ff7110b1d1c0d to your computer and use it in GitHub Desktop.
# This example uses M2M-100 models converted to the CTranslate2 format. | |
# Download CTranslate2 models: | |
# • M2M-100 418M-parameter model: https://bit.ly/33fM1AO | |
# • M2M-100 1.2B-parameter model: https://bit.ly/3GYiaed | |
import ctranslate2 | |
import sentencepiece as spm | |
# [Modify] Set file paths of the source and target | |
source_file_path = "source_test.en" | |
target_file_path = "target_test.ja.mt" | |
# [Modify] Set paths to the CTranslate2 and SentencePiece models | |
ct_model_path = "m2m100_ct2/" | |
sp_model_path = "m2m100_ct2/sentencepiece.model" | |
# [Modify] Set language prefixes of the source and target | |
src_prefix = "__en__" | |
tgt_prefix = "__ja__" | |
# [Modify] Set the device and beam size | |
device = "cpu" # or "cuda" for GPU | |
beam_size = 5 | |
# Load the source SentecePiece model | |
sp = spm.SentencePieceProcessor() | |
sp.load(sp_model_path) | |
# Open the source file | |
with open(source_file_path, "r") as source: | |
lines = source.readlines() | |
source_sents = [line.strip() for line in lines] | |
target_prefix = [[tgt_prefix]] * len(source_sents) | |
# Subword the source sentences | |
source_sents_subworded = sp.encode(source_sents, out_type=str) | |
source_sents_subworded = [[src_prefix] + sent for sent in source_sents_subworded] | |
print("First sentence:", source_sents_subworded[0]) | |
# Translate the source sentences | |
translator = ctranslate2.Translator(ct_model_path, device=device) | |
translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, beam_size=beam_size, target_prefix=target_prefix) | |
translations = [translation[0]['tokens'] for translation in translations] | |
# Desubword the target sentences | |
translations_desubword = sp.decode(translations) | |
translations_desubword = [sent[len(tgt_prefix):] for sent in translations_desubword] | |
print("First translation:", translations_desubword[0]) | |
# Save the translations to the a file | |
with open(target_file_path, "w+", encoding="utf-8") as target: | |
for line in translations_desubword: | |
target.write(line.strip() + "\n") | |
print("Done! Target file saved at:", target_file_path) |
Hi, I'm interested in playing around with this model but the two links to the converted CTranslate2 format fail with access denied. If you still have these models ready, do you think you could fix the links? I tried converting them myself but I'm running into all sorts of errors with the latest versions of these tools, and can't figure out what versions to go back to to make it work.
@gertjanvanzwieten Hello! I have just fixed the links. Thanks!
Btw, I do not know which language pairs you work with, but I think NLLB-200 models are better for most of the supported language pairs.
Thanks! And for pointing me to NLLB-200 as well, I've only just started to explore the space and didn't come across that one yet. I found your forum post now and will certainly include it in my tests.
M2M-100 Multilingual Neural Machine Translation Model
M2M-100 in CTranslate2 format
CTranslate2 is a fast inference engine for Transformer models. It supports models originally trained with OpenNMT-py, OpenNMT-tf, and FairSeq. CTranslate2 is preferred for its high efficiency. It is cross-platform, and can be used either on CPU or GPU.
You can download one of the M2M-100 models, converted to the CTranslate2 format:
How to convert an M2M-100 model to CTranslate2
Alternatively, you can convert an M2M-100 model to the CTranslate2 format yourself as follows:
Translation with M2M-100 models
You can use the script in this gist to translate a source file using M2M-100, as follows:
Make sure you change the paths to the source file
source_file_path
, CTranslate2 modelct_model_path
, SentencePiece modelsp_model_path
.M2M-100 uses a source language token, and target language token. The latter is used for prefix-constrained decoding, to generate the translation in the specified language. In the script, make sure you adjust
src_prefix
andtgt_prefix
. The list of supported languages and their language codes can be found here.Now, run the Python script as usual, which should translate the source file, and generate the target file in the path specified with
target_file_path
.Testing M2M-100 with English-to-Japanese
Test Dataset
M2M-100 418M-parameter model
M2M-100 1.2B-parameter model
Using M2M-100 models with a GUI
You can also use M2M-100 models in DesktopTranslator, a local cross-platform machine translation GUI. It also has stand-alone executables for Mac and Windows.