Skip to content

Instantly share code, notes, and snippets.

@Kishlay-notabot
Last active November 5, 2024 14:54
Show Gist options
  • Save Kishlay-notabot/a04e62a611b25bda413d284abbaaa254 to your computer and use it in GitHub Desktop.
Save Kishlay-notabot/a04e62a611b25bda413d284abbaaa254 to your computer and use it in GitHub Desktop.
Setup IndicTrans2 for local inference to translation of 22 Indian languages

IndicTrans2 Setup Guide

This guide provides instructions for setting up IndicTrans2 specifically for English to Indic language translation using the distilled model for faster inference.

About IndicTrans2

IndicTrans2 is a state-of-the-art open-source multilingual Neural Machine Translation (NMT) model supporting all 22 scheduled Indian languages.

Requirements

  • Python3 [3.10 used in this script, anything >= 3.7 is supported in IndicTrans2 ]
  • Linux environment (Ubuntu/Debian recommended)
  • The setup uses the distilled version of the En-Indic model

Installation

  1. Download and run the setup script:
git clone https://gist.github.com/Kishlay-notabot/a04e62a611b25bda413d284abbaaa254
chmod +x setup.sh
./setup.sh

The script will:

  • Install system dependencies (build-essential, python3.10-dev, parallel)
  • Set up a Python virtual environment
  • Clone IndicTrans2
  • Install Python dependencies including sentencepiece
  • Download and set up the distilled En-Indic model

Usage Examples

Using Python Interface

from inference.engine import Model
import os
import nltk
nltk.download('punkt_tab')

# Get the full path to the model directory
current_dir = os.getcwd()
ckpt_dir = os.path.join(current_dir, "fairseq_model")

# Initialize the model
model = Model(ckpt_dir, model_type="fairseq")

# Sample sentences
sample_sentences = [
    "Welcome to my house.",
    "How are you doing today?",
    "The weather is beautiful."
]

# Translate batch of sentences
translations = model.batch_translate(
    sample_sentences,
    src_lang="eng_Latn",  # English in Latin script
    tgt_lang="hin_Deva"   # Hindi in Devanagari script
)

# Print the results
print("\nBatch Translation Results:")
for src, tgt in zip(sample_sentences, translations):
    print(f"English: {src}")
    print(f"Hindi  : {tgt}")
    print()

# Sample paragraph
sample_paragraph = """
Welcome to India. This is a beautiful country with rich culture and heritage.
The people here are very friendly and helpful. You will enjoy your stay here.
"""

# Translate paragraph
paragraph_translation = model.translate_paragraph(
    sample_paragraph,
    src_lang="eng_Latn",
    tgt_lang="hin_Deva"
)

print("\nParagraph Translation:")
print("English:", sample_paragraph)
print("Hindi  :", paragraph_translation)

The CT2 inference is very fast compared to the fairseq inference:

from inference.engine import Model
import os
import nltk
nltk.download('punkt_tab')

# Get the full path to the model directory
current_dir = os.getcwd()
ckpt_dir = os.path.join(current_dir, "fairseq_model/ct2_fp16_model")

# Initialize the model with ctranslate2 type and specify device as CPU
model = Model(ckpt_dir, model_type="ctranslate2", device="cpu")

# Sample sentences
sample_sentences = [
    "Welcome to my house.",
    "How are you doing today?",
    "The weather is beautiful."
]

# Translate batch of sentences
translations = model.batch_translate(
    sample_sentences,
    src_lang="eng_Latn",
    tgt_lang="hin_Deva"
)

# Print the results
print("\nBatch Translation Results:")
for src, tgt in zip(sample_sentences, translations):
    print(f"English: {src}")
    print(f"Hindi  : {tgt}")
    print()

# Sample paragraph
sample_paragraph = """
Welcome to India. This is a beautiful country with rich culture and heritage.
The people here are very friendly and helpful. You will enjoy your stay here.
"""

# Translate paragraph
paragraph_translation = model.translate_paragraph(
    sample_paragraph,
    src_lang="eng_Latn",
    tgt_lang="hin_Deva"
)

print("\nParagraph Translation:")
print("English:", sample_paragraph)
print("Hindi  :", paragraph_translation)

Using Bash Interface

Create a sample input file (nano.txt):

hi this is sentence one
hi this is sentence two
hi this is a happy sentence!

Run translation:

bash -x joint_translate.sh nano.txt out.txt eng_Latn hin_Deva "$(pwd)/fairseq_model"

The translated output will be saved in out.txt:

नमस्ते यह एक वाक्य है
हाय यह वाक्य दो है
नमस्ते, यह एक सुखद वाक्य है!

Language Codes

IndicTrans2 uses FLORES-200 language codes for identifying languages and scripts. Each code consists of a language code and script code (e.g., hin_Deva for Hindi in Devanagari script).

Common examples:

  • English: eng_Latn (Latin script)
  • Hindi: hin_Deva (Devanagari script)
  • Bengali: ben_Beng (Bengali script)
  • Tamil: tam_Taml (Tamil script)
  • Urdu: urd_Arab (Perso-Arabic script)

For a complete list of supported languages and their codes, please refer to the official IndicTrans2 repository.

Important Notes

  1. This setup uses the distilled version of the En-Indic model for faster inference
  2. The script specifically uses Python 3.10
  3. Make sure you have sufficient disk space for the model (~1GB)
  4. The models are downloaded in the fairseq_model directory within the IndicTrans2 folder

Troubleshooting

If you encounter any issues:

  1. Ensure all dependencies are properly installed
  2. Check if the model files are properly downloaded in the fairseq_model directory
  3. Verify you're using Python 3.10
  4. Make sure you're in the correct directory when running the commands

Model Information

This setup uses the distilled version of the En-Indic model from IndicTrans2, which provides:

  • Faster inference compared to the full model
  • Optimized for English to Indic language translations
  • Smaller model size while maintaining good translation quality

For more information about IndicTrans2, including the full model capabilities, training details, and research papers, visit the official repository.

#!/bin/bash
# Setup script for IndicTrans2 (English-Indic Translation)
# This script uses Python 3.10
# Exit on error
set -e
echo "Setting up IndicTrans2..."
# Install system dependencies
echo "Installing system dependencies..."
sudo apt-get update
sudo apt-get install -y build-essential python3.10-dev python3.10-venv parallel wget unzip
# Create and activate virtual environment
echo "Creating Python virtual environment..."
python3.10 -m venv indictrans2_env
source indictrans2_env/bin/activate
# Clone repository
echo "Cloning IndicTrans2 repository..."
git clone https://github.com/AI4Bharat/IndicTrans2
cd IndicTrans2
# Install Python dependencies
echo "Installing Python requirements..."
pip install --upgrade pip
pip install sentencepiece
source install.sh
# Download and extract distilled model (En-indic)
echo "Downloading distilled model..."
mkdir -p fairseq_model
wget https://indictrans2-public.objectstore.e2enetworks.net/it2_distilled_ckpts/en-indic.zip
unzip en-indic.zip -d fairseq_model/
rm en-indic.zip
echo "
Setup completed successfully!
Usage examples:
1. For bash interface:
bash joint_translate.sh <input_file> <output_file> eng_Latn <target_lang> \"$(pwd)/fairseq_model\"
Example for English to Hindi:
bash joint_translate.sh input.txt output.txt eng_Latn hin_Deva \"$(pwd)/fairseq_model\"
2. The model is downloaded at: $(pwd)/fairseq_model
Note: This setup uses the distilled version of the En-indic model for faster inference.
"
@Kishlay-notabot
Copy link
Author

set -e exits my wsl instance, I don't know why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment