|
--- |
|
base_model: facebook/mbart-large-50-many-to-many-mmt |
|
tags: |
|
- translation |
|
- mbart50 |
|
- english |
|
- telugu |
|
- hackhedron |
|
- neural-machine-translation |
|
- huggingface |
|
license: apache-2.0 |
|
datasets: |
|
- hackhedron |
|
metrics: |
|
- sacrebleu |
|
model-index: |
|
- name: mbart50-en-te-hackhedron |
|
language: |
|
- en |
|
- te |
|
results: |
|
- task: |
|
name: Translation |
|
type: translation |
|
dataset: |
|
name: HackHedron English-Telugu Parallel Corpus |
|
type: hackhedron |
|
args: en-te |
|
metrics: |
|
- name: SacreBLEU |
|
type: sacrebleu |
|
value: 66.9240 |
|
--- |
|
# 🌐 mBART50 English ↔ Telugu | HackHedron Dataset |
|
|
|
This model is fine-tuned from [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) on the [HackHedron English-Telugu Parallel Corpus](https://huggingface.co/datasets). It supports bidirectional translation between **English ↔ Telugu**. |
|
|
|
## 🧠 Model Architecture |
|
|
|
- **Base model**: mBART50 (Multilingual BART with 50 languages) |
|
- **Type**: Seq2Seq Transformer |
|
- **Tokenizer**: MBart50TokenizerFast |
|
- **Languages Used**: |
|
- `en_XX` for English |
|
- `te_IN` for Telugu |
|
|
|
--- |
|
|
|
## 📚 Dataset |
|
|
|
**HackHedron English-Telugu Parallel Corpus** |
|
- ~390,000 training sentence pairs |
|
- ~43,000 validation pairs |
|
- Format: |
|
```json |
|
{ |
|
"english": "Tom started his car and drove away.", |
|
"telugu": "టామ్ తన కారును స్టార్ట్ చేసి దూరంగా నడిపాడు." |
|
} |
|
```` |
|
|
|
--- |
|
|
|
## 📈 Evaluation |
|
|
|
| Metric | Score | Loss | |
|
| --------- | ------ | ------- | |
|
| SacreBLEU | 66.924 | 0.0511 | |
|
|
|
> 🧪 Evaluation done using Hugging Face `evaluate` library on validation set. |
|
> |
|
--- |
|
|
|
## 💻 How to Use |
|
|
|
```python |
|
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast |
|
|
|
model = MBartForConditionalGeneration.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron") |
|
tokenizer = MBart50TokenizerFast.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron") |
|
|
|
# Set source and target language |
|
tokenizer.src_lang = "en_XX" |
|
tokenizer.tgt_lang = "te_IN" |
|
|
|
text = "How are you?" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]) |
|
translated = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
print(translated[0]) |
|
``` |
|
|
|
--- |
|
|
|
## 📦 How to Fine-Tune Further |
|
|
|
Use the `Seq2SeqTrainer` from Hugging Face: |
|
|
|
```python |
|
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments |
|
``` |
|
|
|
Make sure to properly set `forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]` during generation. |
|
|
|
--- |
|
|
|
## 🛠️ Training Details |
|
|
|
* Optimizer: AdamW |
|
* Learning Rate: 2e-05 |
|
* Epochs: 1 |
|
* train_batch_size: 8 |
|
* eval_batch_size: 8 |
|
* seed: 42 |
|
* Truncation Length: 128 tokens |
|
* Framework: 🤗 Transformers + Datasets |
|
* Scheduler: Linear |
|
* Mixed Precision: Enabled (fp16) |
|
|
|
--- |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Bleu | |
|
|:-------------:|:-----:|:-----:|:---------------:|:-------:| |
|
| 0.0455 | 1.0 | 48808 | 0.0511 | 66.9240 | |
|
|
|
--- |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.51.3 |
|
- Pytorch 2.6.0+cu124 |
|
- Datasets 3.6.0 |
|
- Tokenizers 0.21.1 |
|
|
|
--- |
|
|
|
## 🏷️ License |
|
|
|
This model is licensed under [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
|
--- |
|
|
|
## 🤝 Acknowledgements |
|
|
|
* 🤗 Hugging Face Transformers |
|
* Facebook AI for mBART50 |
|
* HackHedron Parallel Corpus Contributors |
|
|
|
--- |
|
|
|
> Created by **Koushik Reddy** – [Hugging Face Profile](https://huggingface.co/Koushim) |