Model Card for Model ID

Model Card for English-to-Darija Translation (mBART Fine-tuned Model)

Model Details

Model Description

This model is a fine-tuned version of the facebook/mbart-large-50-many-to-many-mmt model, specifically tailored for translating English text to Moroccan Darija in Arabic script. The model was trained on a custom dataset of English-Darija sentence pairs, and it has been designed to accurately capture the nuances of the Moroccan dialect. This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: Aicha Lahnouki
Finetuned from model: facebook/mbart-large-50-many-to-many-mmt
Model type: Sequence-to-Sequence Translation (mBART architecture)
Language(s) (NLP): English (en_XX), Darija (ar_AR)

Uses

Direct Use

This model is intended for translating English sentences into Moroccan Darija in Arabic script. It can be used in applications such as translation services, language learning tools, or chatbots.

Bias, Risks, and Limitations

This model was trained on 50% of the dataset provided by DODa, consisting of 45,000 rows. The testing was conducted on a sample of 100 sentences. Due to the reduced training data, the model might not capture the full linguistic diversity of English-to-Darija translations. Additionally, the limited test size may not fully represent the model's performance across all possible inputs, leading to potential biases or inaccuracies when applied to unseen or diverse data.

How to Get Started with the Model

You can start using the model for English-to-Darija translation with the following code:

from transformers import pipeline

# Initialize the translation pipeline
pipe = pipeline("translation", model="alpha2002/eng_alpha_darija", tokenizer="alpha2002/eng_alpha_darija")

# Translate English to Darija
input_text = "Hello, how are you?"
translation = pipe(input_text, src_lang="en_XX", tgt_lang="ar_AR")

print("Translation:", translation[0]['translation_text'])

Training Details

Training Data

The model was trained on a custom dataset containing parallel English and Darija sentences. The dataset was preprocessed to include language tokens specific to mBART's requirements.

Training Procedure

Preprocessing [optional]

The English text was tokenized with the token, and the Darija text with the token.

Training Hyperparameters

Training regime: FP16 mixed precision was used during training to improve performance. Training was done on Google Colab using a subset of the data, with gradient accumulation to handle larger batch sizes.

Speeds, Sizes, Times [optional]

The model was trained for 2 epochs with a batch size of 4, using the Seq2SeqTrainer from the Hugging Face Transformers library.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on a small set of held-out test sentences: 100 samples.

Metrics

BLEU score was used to measure translation accuracy.

Results

The model achieved a BLEU score of 11.6 on the test set, indicating a reasonable level of accuracy given the complexity of translating between languages with different scripts and linguistic structures.

Environmental Impact

Hardware Type: Google Colab GPU (NVIDIA Tesla K80)
Hours used: Approximately 2 hours for training and 1hour for testing.

Citation [optional]

BibTeX:

@misc{lahnouki2024eng_alpha_darija, author = {Aicha Lahnouki}, title = {English-to-Darija Translation Model}, year = {2024}, url = {https://huggingface.co/alpha2002/eng_alpha_darija}, }

Model Card Authors [optional]

Lahnouki Aicha

Model Card Contact

email: aichalahnouki@gmail.com

Downloads last month: 10

Safetensors

Model size

0.6B params

Tensor type

F32

alpha2002
/

eng_alpha_darija