🌍 Multilingual NMT with Knowledge Distillation using FLORES-101

🚀 Project Overview

This project explores Multilingual Neural Machine Translation (NMT) through Knowledge Distillation using the FLORES-101 dataset for training and evaluation. The goal is to enable high-quality, bidirectional translation among:

5 Indian Languages: Hindi, Tamil, Telugu, Kannada, Malayalam
5 Global Languages: English, French, German, Spanish, Japanese

Each language is translated to and from every other, creating 90 language pairs (bidirectional translations).

🧠 Methodology

Teacher Model:

NLLB (facebook/nllb-200-distilled-600M)
A strong multilingual model capable of translating between 200+ languages.

Student Models:

mBART (facebook/mbart-large-50-many-to-many-mmt)
IndicBART (ai4bharat/indicbart)

Distillation Strategy:

The teacher model generates translations for all sentence pairs, and student models are trained to mimic the output. This reduces model size while maintaining translation quality.

📘 Dataset: FLORES-101

FLORES-101 provides 101 languages with aligned sentences for translation evaluation. We use the dev set (devtest) to generate high-quality, consistent training pairs.

Languages & FLORES Codes:

Language	Code
English	eng_Latn
Hindi	hin_Deva
Tamil	tam_Taml
Telugu	tel_Telu
Kannada	kan_Knda
Malayalam	mal_Mlym
French	fra_Latn
German	deu_Latn
Spanish	spa_Latn
Japanese	jpn_Jpan

Data Generation

All possible bidirectional pairs (e.g., en→ta, ta→en, ta→hi, hi→ta) were created, resulting in 90 parallel datasets.

Nova35
/

nllb-mbart-indic-distilled