π Multilingual NMT with Knowledge Distillation using FLORES-101
π Project Overview
This project explores Multilingual Neural Machine Translation (NMT) through Knowledge Distillation using the FLORES-101 dataset for training and evaluation. The goal is to enable high-quality, bidirectional translation among:
- 5 Indian Languages: Hindi, Tamil, Telugu, Kannada, Malayalam
- 5 Global Languages: English, French, German, Spanish, Japanese
Each language is translated to and from every other, creating 90 language pairs (bidirectional translations).
π§ Methodology
Teacher Model:
- NLLB (facebook/nllb-200-distilled-600M)
A strong multilingual model capable of translating between 200+ languages.
Student Models:
- mBART (facebook/mbart-large-50-many-to-many-mmt)
- IndicBART (ai4bharat/indicbart)
Distillation Strategy:
The teacher model generates translations for all sentence pairs, and student models are trained to mimic the output. This reduces model size while maintaining translation quality.
π Dataset: FLORES-101
FLORES-101 provides 101 languages with aligned sentences for translation evaluation. We use the dev set (devtest) to generate high-quality, consistent training pairs.
Languages & FLORES Codes:
Language | Code |
---|---|
English | eng_Latn |
Hindi | hin_Deva |
Tamil | tam_Taml |
Telugu | tel_Telu |
Kannada | kan_Knda |
Malayalam | mal_Mlym |
French | fra_Latn |
German | deu_Latn |
Spanish | spa_Latn |
Japanese | jpn_Jpan |
Data Generation
All possible bidirectional pairs (e.g., enβta
, taβen
, taβhi
, hiβta
) were created, resulting in 90 parallel datasets.
- Downloads last month
- 47