--- base_model: facebook/mbart-large-50-many-to-many-mmt tags: - translation - mbart50 - english - telugu - hackhedron - neural-machine-translation - huggingface license: apache-2.0 datasets: - hackhedron metrics: - sacrebleu model-index: - name: mbart50-en-te-hackhedron language: - en - te results: - task: name: Translation type: translation dataset: name: HackHedron English-Telugu Parallel Corpus type: hackhedron args: en-te metrics: - name: SacreBLEU type: sacrebleu value: 66.9240 --- # 🌐 mBART50 English ↔ Telugu | HackHedron Dataset This model is fine-tuned from [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) on the [HackHedron English-Telugu Parallel Corpus](https://huggingface.co/datasets). It supports bidirectional translation between **English ↔ Telugu**. ## 🧠 Model Architecture - **Base model**: mBART50 (Multilingual BART with 50 languages) - **Type**: Seq2Seq Transformer - **Tokenizer**: MBart50TokenizerFast - **Languages Used**: - `en_XX` for English - `te_IN` for Telugu --- ## 📚 Dataset **HackHedron English-Telugu Parallel Corpus** - ~390,000 training sentence pairs - ~43,000 validation pairs - Format: ```json { "english": "Tom started his car and drove away.", "telugu": "టామ్ తన కారును స్టార్ట్ చేసి దూరంగా నడిపాడు." } ```` --- ## 📈 Evaluation | Metric | Score | Loss | | --------- | ------ | ------- | | SacreBLEU | 66.924 | 0.0511 | > 🧪 Evaluation done using Hugging Face `evaluate` library on validation set. > --- ## 💻 How to Use ```python from transformers import MBartForConditionalGeneration, MBart50TokenizerFast model = MBartForConditionalGeneration.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron") tokenizer = MBart50TokenizerFast.from_pretrained("koushik-reddy/mbart50-en-te-hackhedron") # Set source and target language tokenizer.src_lang = "en_XX" tokenizer.tgt_lang = "te_IN" text = "How are you?" inputs = tokenizer(text, return_tensors="pt") generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]) translated = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) print(translated[0]) ``` --- ## 📦 How to Fine-Tune Further Use the `Seq2SeqTrainer` from Hugging Face: ```python from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments ``` Make sure to properly set `forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"]` during generation. --- ## 🛠️ Training Details * Optimizer: AdamW * Learning Rate: 2e-05 * Epochs: 1 * train_batch_size: 8 * eval_batch_size: 8 * seed: 42 * Truncation Length: 128 tokens * Framework: 🤗 Transformers + Datasets * Scheduler: Linear * Mixed Precision: Enabled (fp16) --- ### Training results | Training Loss | Epoch | Step | Validation Loss | Bleu | |:-------------:|:-----:|:-----:|:---------------:|:-------:| | 0.0455 | 1.0 | 48808 | 0.0511 | 66.9240 | --- ### Framework versions - Transformers 4.51.3 - Pytorch 2.6.0+cu124 - Datasets 3.6.0 - Tokenizers 0.21.1 --- ## 🏷️ License This model is licensed under [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). --- ## 🤝 Acknowledgements * 🤗 Hugging Face Transformers * Facebook AI for mBART50 * HackHedron Parallel Corpus Contributors --- > Created by **Koushik Reddy** – [Hugging Face Profile](https://huggingface.co/Koushim)