Model Card for Model ID
Model Card for English-to-Darija Translation (mBART Fine-tuned Model)
Model Details
Model Description
This model is a fine-tuned version of the facebook/mbart-large-50-many-to-many-mmt model, specifically tailored for translating English text to Moroccan Darija in Arabic script. The model was trained on a custom dataset of English-Darija sentence pairs, and it has been designed to accurately capture the nuances of the Moroccan dialect. This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: Aicha Lahnouki
- Finetuned from model: facebook/mbart-large-50-many-to-many-mmt
- Model type: Sequence-to-Sequence Translation (mBART architecture)
- Language(s) (NLP): English (en_XX), Darija (ar_AR)
Uses
Direct Use
This model is intended for translating English sentences into Moroccan Darija in Arabic script. It can be used in applications such as translation services, language learning tools, or chatbots.
Bias, Risks, and Limitations
This model was trained on 50% of the dataset provided by DODa, consisting of 45,000 rows. The testing was conducted on a sample of 100 sentences. Due to the reduced training data, the model might not capture the full linguistic diversity of English-to-Darija translations. Additionally, the limited test size may not fully represent the model's performance across all possible inputs, leading to potential biases or inaccuracies when applied to unseen or diverse data.
How to Get Started with the Model
You can start using the model for English-to-Darija translation with the following code:
from transformers import pipeline
# Initialize the translation pipeline
pipe = pipeline("translation", model="alpha2002/eng_alpha_darija", tokenizer="alpha2002/eng_alpha_darija")
# Translate English to Darija
input_text = "Hello, how are you?"
translation = pipe(input_text, src_lang="en_XX", tgt_lang="ar_AR")
print("Translation:", translation[0]['translation_text'])
Training Details
Training Data
The model was trained on a custom dataset containing parallel English and Darija sentences. The dataset was preprocessed to include language tokens specific to mBART's requirements.
Training Procedure
Preprocessing [optional]
The English text was tokenized with the token, and the Darija text with the token.
Training Hyperparameters
- Training regime: FP16 mixed precision was used during training to improve performance. Training was done on Google Colab using a subset of the data, with gradient accumulation to handle larger batch sizes.
Speeds, Sizes, Times [optional]
The model was trained for 2 epochs with a batch size of 4, using the Seq2SeqTrainer from the Hugging Face Transformers library.
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on a small set of held-out test sentences: 100 samples.
Metrics
BLEU score was used to measure translation accuracy.
Results
The model achieved a BLEU score of 11.6 on the test set, indicating a reasonable level of accuracy given the complexity of translating between languages with different scripts and linguistic structures.
Environmental Impact
- Hardware Type: Google Colab GPU (NVIDIA Tesla K80)
- Hours used: Approximately 2 hours for training and 1hour for testing.
Citation [optional]
BibTeX:
@misc{lahnouki2024eng_alpha_darija, author = {Aicha Lahnouki}, title = {English-to-Darija Translation Model}, year = {2024}, url = {https://huggingface.co/alpha2002/eng_alpha_darija}, }
Model Card Authors [optional]
Lahnouki Aicha
Model Card Contact
email: [email protected]
- Downloads last month
- 53