YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🌍 Multilingual NMT with Knowledge Distillation using FLORES-101

πŸš€ Project Overview

This project explores Multilingual Neural Machine Translation (NMT) through Knowledge Distillation using the FLORES-101 dataset for training and evaluation. The goal is to enable high-quality, bidirectional translation among:

  • 5 Indian Languages: Hindi, Tamil, Telugu, Kannada, Malayalam
  • 5 Global Languages: English, French, German, Spanish, Japanese

Each language is translated to and from every other, creating 90 language pairs (bidirectional translations).


🧠 Methodology

Teacher Model:

  • NLLB (facebook/nllb-200-distilled-600M)
    A strong multilingual model capable of translating between 200+ languages.

Student Models:

  • mBART (facebook/mbart-large-50-many-to-many-mmt)
  • IndicBART (ai4bharat/indicbart)

Distillation Strategy:

The teacher model generates translations for all sentence pairs, and student models are trained to mimic the output. This reduces model size while maintaining translation quality.


πŸ“˜ Dataset: FLORES-101

FLORES-101 provides 101 languages with aligned sentences for translation evaluation. We use the dev set (devtest) to generate high-quality, consistent training pairs.

Languages & FLORES Codes:

Language Code
English eng_Latn
Hindi hin_Deva
Tamil tam_Taml
Telugu tel_Telu
Kannada kan_Knda
Malayalam mal_Mlym
French fra_Latn
German deu_Latn
Spanish spa_Latn
Japanese jpn_Jpan

Data Generation

All possible bidirectional pairs (e.g., en→ta, ta→en, ta→hi, hi→ta) were created, resulting in 90 parallel datasets.


Downloads last month
47
Safetensors
Model size
615M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using Nova35/nllb-mbart-indic-distilled 1