Koushim
/

mbart50-en-te-hackhedron

neural-machine-translation

Model card Files Files and versions

mbart50-en-te-hackhedron / README.md

Koushim's picture

Update README.md

01901ca verified 2 months ago

|

history blame contribute delete

3.66 kB

	---
	base_model: facebook/mbart-large-50-many-to-many-mmt
	tags:
	- translation
	- mbart50
	- english
	- telugu
	- hackhedron
	- neural-machine-translation
	- huggingface
	license: apache-2.0
	datasets:
	- hackhedron
	metrics:
	- sacrebleu
	model-index:
	- name: mbart50-en-te-hackhedron
	language:
	- en
	- te
	results:
	- task:
	name: Translation
	type: translation
	dataset:
	name: HackHedron English-Telugu Parallel Corpus
	type: hackhedron
	args: en-te
	metrics:
	- name: SacreBLEU
	type: sacrebleu
	value: 66.9240
	---
	# 🌐 mBART50 English ↔ Telugu \| HackHedron Dataset

	This model is fine-tuned from [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) on the [HackHedron English-Telugu Parallel Corpus](https://huggingface.co/datasets). It supports bidirectional translation between English ↔ Telugu.

	## 🧠 Model Architecture

	- Base model: mBART50 (Multilingual BART with 50 languages)
	- Type: Seq2Seq Transformer
	- Tokenizer: MBart50TokenizerFast
	- Languages Used:
	- `en_XX` for English
	- `te_IN` for Telugu

	---

	## 📚 Dataset

	HackHedron English-Telugu Parallel Corpus
	- ~390,000 training sentence pairs
	- ~43,000 validation pairs
	- Format:
	```json
	{
	"english": "Tom started his car and drove away.",
	"telugu": "టామ్ తన కారును స్టార్ట్ చేసి దూరంగా నడిపాడు."
	}
	````

	---

	## 📈 Evaluation

	\| Metric \| Score \| Loss \|
	\| --------- \| ------ \| ------- \|
	\| SacreBLEU \| 66.924 \| 0.0511 \|

	> 🧪 Evaluation done using Hugging Face `evaluate` library on validation set.
	>
	---

	## 💻 How to Use

	```python
	from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

	model = MBartForConditionalGeneration.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")
	tokenizer = MBart50TokenizerFast.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron")

	# Set source and target language
	tokenizer.src_lang = "en_XX"
	tokenizer.tgt_lang = "te_IN"

	text = "How are you?"
	inputs = tokenizer(text, return_tensors="pt")
	generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
	translated = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
	print(translated[0])
	```

	---

	## 📦 How to Fine-Tune Further

	Use the `Seq2SeqTrainer` from Hugging Face:

	```python
	from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
	```

	Make sure to properly set `forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]` during generation.

	---

	## 🛠️ Training Details

	* Optimizer: AdamW
	* Learning Rate: 2e-05
	* Epochs: 1
	* train_batch_size: 8
	* eval_batch_size: 8
	* seed: 42
	* Truncation Length: 128 tokens
	* Framework: 🤗 Transformers + Datasets
	* Scheduler: Linear
	* Mixed Precision: Enabled (fp16)

	---

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Bleu \|
	\|:-------------:\|:-----:\|:-----:\|:---------------:\|:-------:\|
	\| 0.0455 \| 1.0 \| 48808 \| 0.0511 \| 66.9240 \|

	---

	### Framework versions

	- Transformers 4.51.3
	- Pytorch 2.6.0+cu124
	- Datasets 3.6.0
	- Tokenizers 0.21.1

	---

	## 🏷️ License

	This model is licensed under [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

	---

	## 🤝 Acknowledgements

	* 🤗 Hugging Face Transformers
	* Facebook AI for mBART50
	* HackHedron Parallel Corpus Contributors

	---

	> Created by Koushik Reddy – [Hugging Face Profile](https://huggingface.co/Koushim)