File size: 3,660 Bytes
7ecaf24
 
 
01901ca
 
 
 
 
 
 
 
 
 
7ecaf24
01901ca
7ecaf24
01901ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ecaf24
01901ca
7ecaf24
01901ca
7ecaf24
01901ca
7ecaf24
01901ca
 
 
 
 
 
7ecaf24
01901ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ecaf24
01901ca
7ecaf24
01901ca
7ecaf24
01901ca
 
 
7ecaf24
01901ca
7ecaf24
01901ca
7ecaf24
01901ca
7ecaf24
01901ca
 
 
 
 
 
 
 
 
 
7ecaf24
01901ca
7ecaf24
 
 
 
 
 
 
01901ca
7ecaf24
 
 
 
 
 
 
01901ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
base_model: facebook/mbart-large-50-many-to-many-mmt
tags:
  - translation
  - mbart50
  - english
  - telugu
  - hackhedron
  - neural-machine-translation
  - huggingface
license: apache-2.0
datasets:
  - hackhedron
metrics:
  - sacrebleu
model-index:
  - name: mbart50-en-te-hackhedron
    language:
      - en
      - te
    results:
      - task:
          name: Translation
          type: translation
        dataset:
          name: HackHedron English-Telugu Parallel Corpus
          type: hackhedron
          args: en-te
        metrics:
          - name: SacreBLEU
            type: sacrebleu
            value: 66.9240  
---
# 🌐 mBART50 English ↔ Telugu | HackHedron Dataset

This model is fine-tuned from [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) on the [HackHedron English-Telugu Parallel Corpus](https://huggingface.co/datasets). It supports bidirectional translation between **English ↔ Telugu**.

## 🧠 Model Architecture

- **Base model**: mBART50 (Multilingual BART with 50 languages)
- **Type**: Seq2Seq Transformer
- **Tokenizer**: MBart50TokenizerFast
- **Languages Used**:
  - `en_XX` for English
  - `te_IN` for Telugu

---

## 📚 Dataset

**HackHedron English-Telugu Parallel Corpus**  
- ~390,000 training sentence pairs  
- ~43,000 validation pairs  
- Format:
```json
{
  "english": "Tom started his car and drove away.",
  "telugu": "టామ్ తన కారును స్టార్ట్ చేసి దూరంగా నడిపాడు."
}
````

---

## 📈 Evaluation

| Metric    | Score  |  Loss   |
| --------- | ------ | ------- |
| SacreBLEU | 66.924 |  0.0511 |

> 🧪 Evaluation done using Hugging Face `evaluate` library on validation set.
> 
---

## 💻 How to Use

```python
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")
tokenizer = MBart50TokenizerFast.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")

# Set source and target language
tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "te_IN"

text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
translated = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translated[0])
```

---

## 📦 How to Fine-Tune Further

Use the `Seq2SeqTrainer` from Hugging Face:

```python
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
```

Make sure to properly set `forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]` during generation.

---

## 🛠️ Training Details

* Optimizer: AdamW
* Learning Rate: 2e-05
* Epochs: 1
* train_batch_size: 8
* eval_batch_size: 8
* seed: 42
* Truncation Length: 128 tokens
* Framework: 🤗 Transformers + Datasets
* Scheduler: Linear
* Mixed Precision: Enabled (fp16)

---

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Bleu    |
|:-------------:|:-----:|:-----:|:---------------:|:-------:|
| 0.0455        | 1.0   | 48808 | 0.0511          | 66.9240 |

---

### Framework versions

- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.6.0
- Tokenizers 0.21.1

---

## 🏷️ License

This model is licensed under [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

---

## 🤝 Acknowledgements

* 🤗 Hugging Face Transformers
* Facebook AI for mBART50
* HackHedron Parallel Corpus Contributors

---

> Created by **Koushik Reddy** – [Hugging Face Profile](https://huggingface.co/Koushim)