Koushim's picture
Update README.md
01901ca verified
---
base_model: facebook/mbart-large-50-many-to-many-mmt
tags:
- translation
- mbart50
- english
- telugu
- hackhedron
- neural-machine-translation
- huggingface
license: apache-2.0
datasets:
- hackhedron
metrics:
- sacrebleu
model-index:
- name: mbart50-en-te-hackhedron
language:
- en
- te
results:
- task:
name: Translation
type: translation
dataset:
name: HackHedron English-Telugu Parallel Corpus
type: hackhedron
args: en-te
metrics:
- name: SacreBLEU
type: sacrebleu
value: 66.9240
---
# 🌐 mBART50 English ↔ Telugu | HackHedron Dataset
This model is fine-tuned from [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) on the [HackHedron English-Telugu Parallel Corpus](https://huggingface.co/datasets). It supports bidirectional translation between **English ↔ Telugu**.
## 🧠 Model Architecture
- **Base model**: mBART50 (Multilingual BART with 50 languages)
- **Type**: Seq2Seq Transformer
- **Tokenizer**: MBart50TokenizerFast
- **Languages Used**:
- `en_XX` for English
- `te_IN` for Telugu
---
## 📚 Dataset
**HackHedron English-Telugu Parallel Corpus**
- ~390,000 training sentence pairs
- ~43,000 validation pairs
- Format:
```json
{
"english": "Tom started his car and drove away.",
"telugu": "టామ్ తన కారును స్టార్ట్ చేసి దూరంగా నడిపాడు."
}
````
---
## 📈 Evaluation
| Metric | Score | Loss |
| --------- | ------ | ------- |
| SacreBLEU | 66.924 | 0.0511 |
> 🧪 Evaluation done using Hugging Face `evaluate` library on validation set.
>
---
## 💻 How to Use
```python
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")
tokenizer = MBart50TokenizerFast.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")
# Set source and target language
tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "te_IN"
text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
translated = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translated[0])
```
---
## 📦 How to Fine-Tune Further
Use the `Seq2SeqTrainer` from Hugging Face:
```python
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
```
Make sure to properly set `forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]` during generation.
---
## 🛠️ Training Details
* Optimizer: AdamW
* Learning Rate: 2e-05
* Epochs: 1
* train_batch_size: 8
* eval_batch_size: 8
* seed: 42
* Truncation Length: 128 tokens
* Framework: 🤗 Transformers + Datasets
* Scheduler: Linear
* Mixed Precision: Enabled (fp16)
---
### Training results
| Training Loss | Epoch | Step | Validation Loss | Bleu |
|:-------------:|:-----:|:-----:|:---------------:|:-------:|
| 0.0455 | 1.0 | 48808 | 0.0511 | 66.9240 |
---
### Framework versions
- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.6.0
- Tokenizers 0.21.1
---
## 🏷️ License
This model is licensed under [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
---
## 🤝 Acknowledgements
* 🤗 Hugging Face Transformers
* Facebook AI for mBART50
* HackHedron Parallel Corpus Contributors
---
> Created by **Koushik Reddy** – [Hugging Face Profile](https://huggingface.co/Koushim)