Last active
September 1, 2024 13:23
-
-
Save ymoslem/a414a0ead0d3e50f4d7ff7110b1d1c0d to your computer and use it in GitHub Desktop.
Example of translating a file with M2M-100 using CTranslate2
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This example uses M2M-100 models converted to the CTranslate2 format. | |
# Download CTranslate2 models: | |
# • M2M-100 418M-parameter model: https://bit.ly/33fM1AO | |
# • M2M-100 1.2B-parameter model: https://bit.ly/3GYiaed | |
import ctranslate2 | |
import sentencepiece as spm | |
# [Modify] Set file paths of the source and target | |
source_file_path = "source_test.en" | |
target_file_path = "target_test.ja.mt" | |
# [Modify] Set paths to the CTranslate2 and SentencePiece models | |
ct_model_path = "m2m100_ct2/" | |
sp_model_path = "m2m100_ct2/sentencepiece.model" | |
# [Modify] Set language prefixes of the source and target | |
src_prefix = "__en__" | |
tgt_prefix = "__ja__" | |
# [Modify] Set the device and beam size | |
device = "cpu" # or "cuda" for GPU | |
beam_size = 5 | |
# Load the source SentecePiece model | |
sp = spm.SentencePieceProcessor() | |
sp.load(sp_model_path) | |
# Open the source file | |
with open(source_file_path, "r") as source: | |
lines = source.readlines() | |
source_sents = [line.strip() for line in lines] | |
target_prefix = [[tgt_prefix]] * len(source_sents) | |
# Subword the source sentences | |
source_sents_subworded = sp.encode(source_sents, out_type=str) | |
source_sents_subworded = [[src_prefix] + sent for sent in source_sents_subworded] | |
print("First sentence:", source_sents_subworded[0]) | |
# Translate the source sentences | |
translator = ctranslate2.Translator(ct_model_path, device=device) | |
translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, beam_size=beam_size, target_prefix=target_prefix) | |
translations = [translation[0]['tokens'] for translation in translations] | |
# Desubword the target sentences | |
translations_desubword = sp.decode(translations) | |
translations_desubword = [sent[len(tgt_prefix):] for sent in translations_desubword] | |
print("First translation:", translations_desubword[0]) | |
# Save the translations to the a file | |
with open(target_file_path, "w+", encoding="utf-8") as target: | |
for line in translations_desubword: | |
target.write(line.strip() + "\n") | |
print("Done! Target file saved at:", target_file_path) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
M2M-100 Multilingual Neural Machine Translation Model
M2M-100 in CTranslate2 format
CTranslate2 is a fast inference engine for Transformer models. It supports models originally trained with OpenNMT-py, OpenNMT-tf, and FairSeq. CTranslate2 is preferred for its high efficiency. It is cross-platform, and can be used either on CPU or GPU.
You can download one of the M2M-100 models, converted to the CTranslate2 format:
How to convert an M2M-100 model to CTranslate2
Alternatively, you can convert an M2M-100 model to the CTranslate2 format yourself as follows:
Translation with M2M-100 models
You can use the script in this gist to translate a source file using M2M-100, as follows:
Make sure you change the paths to the source file
source_file_path
, CTranslate2 modelct_model_path
, SentencePiece modelsp_model_path
.M2M-100 uses a source language token, and target language token. The latter is used for prefix-constrained decoding, to generate the translation in the specified language. In the script, make sure you adjust
src_prefix
andtgt_prefix
. The list of supported languages and their language codes can be found here.Now, run the Python script as usual, which should translate the source file, and generate the target file in the path specified with
target_file_path
.Testing M2M-100 with English-to-Japanese
Test Dataset
M2M-100 418M-parameter model
M2M-100 1.2B-parameter model
Using M2M-100 models with a GUI
You can also use M2M-100 models in DesktopTranslator, a local cross-platform machine translation GUI. It also has stand-alone executables for Mac and Windows.