This guide provides instructions for setting up IndicTrans2 specifically for English to Indic language translation using the distilled model for faster inference.
IndicTrans2 is a state-of-the-art open-source multilingual Neural Machine Translation (NMT) model supporting all 22 scheduled Indian languages.
- Python3 [3.10 used in this script, anything >= 3.7 is supported in IndicTrans2 ]
- Linux environment (Ubuntu/Debian recommended)
- The setup uses the distilled version of the En-Indic model
- Download and run the setup script:
git clone https://gist.github.com/Kishlay-notabot/a04e62a611b25bda413d284abbaaa254
chmod +x setup.sh
./setup.sh
The script will:
- Install system dependencies (build-essential, python3.10-dev, parallel)
- Set up a Python virtual environment
- Clone IndicTrans2
- Install Python dependencies including sentencepiece
- Download and set up the distilled En-Indic model
from inference.engine import Model
import os
import nltk
nltk.download('punkt_tab')
# Get the full path to the model directory
current_dir = os.getcwd()
ckpt_dir = os.path.join(current_dir, "fairseq_model")
# Initialize the model
model = Model(ckpt_dir, model_type="fairseq")
# Sample sentences
sample_sentences = [
"Welcome to my house.",
"How are you doing today?",
"The weather is beautiful."
]
# Translate batch of sentences
translations = model.batch_translate(
sample_sentences,
src_lang="eng_Latn", # English in Latin script
tgt_lang="hin_Deva" # Hindi in Devanagari script
)
# Print the results
print("\nBatch Translation Results:")
for src, tgt in zip(sample_sentences, translations):
print(f"English: {src}")
print(f"Hindi : {tgt}")
print()
# Sample paragraph
sample_paragraph = """
Welcome to India. This is a beautiful country with rich culture and heritage.
The people here are very friendly and helpful. You will enjoy your stay here.
"""
# Translate paragraph
paragraph_translation = model.translate_paragraph(
sample_paragraph,
src_lang="eng_Latn",
tgt_lang="hin_Deva"
)
print("\nParagraph Translation:")
print("English:", sample_paragraph)
print("Hindi :", paragraph_translation)
The CT2 inference is very fast compared to the fairseq inference:
from inference.engine import Model
import os
import nltk
nltk.download('punkt_tab')
# Get the full path to the model directory
current_dir = os.getcwd()
ckpt_dir = os.path.join(current_dir, "fairseq_model/ct2_fp16_model")
# Initialize the model with ctranslate2 type and specify device as CPU
model = Model(ckpt_dir, model_type="ctranslate2", device="cpu")
# Sample sentences
sample_sentences = [
"Welcome to my house.",
"How are you doing today?",
"The weather is beautiful."
]
# Translate batch of sentences
translations = model.batch_translate(
sample_sentences,
src_lang="eng_Latn",
tgt_lang="hin_Deva"
)
# Print the results
print("\nBatch Translation Results:")
for src, tgt in zip(sample_sentences, translations):
print(f"English: {src}")
print(f"Hindi : {tgt}")
print()
# Sample paragraph
sample_paragraph = """
Welcome to India. This is a beautiful country with rich culture and heritage.
The people here are very friendly and helpful. You will enjoy your stay here.
"""
# Translate paragraph
paragraph_translation = model.translate_paragraph(
sample_paragraph,
src_lang="eng_Latn",
tgt_lang="hin_Deva"
)
print("\nParagraph Translation:")
print("English:", sample_paragraph)
print("Hindi :", paragraph_translation)
Create a sample input file (nano.txt):
hi this is sentence one
hi this is sentence two
hi this is a happy sentence!
Run translation:
bash -x joint_translate.sh nano.txt out.txt eng_Latn hin_Deva "$(pwd)/fairseq_model"
The translated output will be saved in out.txt
:
नमस्ते यह एक वाक्य है
हाय यह वाक्य दो है
नमस्ते, यह एक सुखद वाक्य है!
IndicTrans2 uses FLORES-200 language codes for identifying languages and scripts. Each code consists of a language code and script code (e.g., hin_Deva
for Hindi in Devanagari script).
Common examples:
- English:
eng_Latn
(Latin script) - Hindi:
hin_Deva
(Devanagari script) - Bengali:
ben_Beng
(Bengali script) - Tamil:
tam_Taml
(Tamil script) - Urdu:
urd_Arab
(Perso-Arabic script)
For a complete list of supported languages and their codes, please refer to the official IndicTrans2 repository.
- This setup uses the distilled version of the En-Indic model for faster inference
- The script specifically uses Python 3.10
- Make sure you have sufficient disk space for the model (~1GB)
- The models are downloaded in the
fairseq_model
directory within the IndicTrans2 folder
If you encounter any issues:
- Ensure all dependencies are properly installed
- Check if the model files are properly downloaded in the fairseq_model directory
- Verify you're using Python 3.10
- Make sure you're in the correct directory when running the commands
This setup uses the distilled version of the En-Indic model from IndicTrans2, which provides:
- Faster inference compared to the full model
- Optimized for English to Indic language translations
- Smaller model size while maintaining good translation quality
For more information about IndicTrans2, including the full model capabilities, training details, and research papers, visit the official repository.
set -e
exits my wsl instance, I don't know why.