πŸ•Œ English β†’ Moroccan Darija Translator

This repository provides a machine translation model for translating English into Moroccan Darija (Ψ§Ω„Ψ―Ψ§Ψ±Ψ¬Ψ© Ψ§Ω„Ω…ΨΊΨ±Ψ¨ΩŠΨ©).
The model is fine-tuned to handle conversational, cultural, and everyday expressions, producing natural Moroccan Darija output.


πŸš€ Model Details

  • Model ID: oddadmix/English-Moroccan-Darija-v1
  • Framework: Hugging Face transformers
  • Task: English β†’ Moroccan Darija translation
  • Language Pair: English β†’ Moroccan Darija
  • Context Window: 32K tokens

πŸ“– Usage

Install the required libraries:

pip install transformers

Run the translation:

from transformers import pipeline

model_id = "oddadmix/English-Moroccan-Darija-v1"
translate = pipeline("text-generation", model=model_id)

messages = [
    {"role": "system", "content": "Translate to Moroccan Darija"},
    {"role": "user", "content": "How are you today?"}
]

translation = translate(
    messages,
    max_new_tokens=8000,
    do_sample=True,
    temperature=0.3,
    min_p=0.15,
    repetition_penalty=1.05
)

print(translation)

Example Output:

ΩƒΩŠΩ داير Ψ§Ω„ΩŠΩˆΩ…ΨŸ

⚠️ Important Note: The system prompt ({"role": "system", "content": "Translate to Moroccan Darija"}) is crucial.
Without it, the model will not perform at its best capacity.


πŸ“Š Benchmark Results

The model has been evaluated against other strong LLMs on the English β†’ Moroccan Darija task as a proxy benchmark.
For Moroccan Darija evaluation, a dataset of 300 sentences manually translated by a Moroccan translator was used.

🧾 Evaluation Dataset Coverage

The dataset spans diverse domains, ensuring wide coverage:

  • Daily Life & Family: greetings, weather, school, transportation, family meals.
  • Food & Cooking: couscous, tagines, vegetables, desserts, cooking instructions.
  • Travel & Geography: Marrakech, Tangier, Casablanca, Rabat, Agadir, public transport.
  • Work & Business: meetings, HR, finance, reports, management.
  • Politics & Government: parliament debates, policies, laws, elections.
  • Arts & Culture: music, painting, poetry, theater, sculpture.
  • Education & Health: doctors, hospitals, lessons, assignments, public health.

This diversity makes the benchmark a strong representation of real-world translation scenarios.

Model BLEU METEOR chrF Task
Claude-Sonnet-4 0.312 0.566 62.09 English β†’ Moroccan Darija
GPT-5-mini 0.381 0.637 66.58 English β†’ Moroccan Darija
GPT-5 0.284 0.551 61.73 English β†’ Moroccan Darija
GPT-4.1 0.306 0.575 61.87 English β†’ Moroccan Darija
oddadmix/English-Moroccan-Darija-v1 0.423 0.644 67.31 English β†’ Moroccan Darija

➑️ Our model achieves state-of-the-art performance while delivering specialized Moroccan Darija output across a wide variety of contexts.


🌍 Applications

  • Translating English educational material into Moroccan Darija.
  • Supporting Moroccan dialect localization for chatbots, apps, and websites.
  • Preserving cultural nuances in translations (not just MSA literalism).

⚠️ Notes

  • Output is optimized for natural conversational Moroccan Darija, not Modern Standard Arabic (MSA).
  • Since Darija is a primarily spoken dialect, spelling conventions may vary slightly.
  • Thanks to its 32K context window, the model can handle long documents and complex conversations seamlessly.
  • Always include the system prompt to unlock the model’s best performance.

πŸ”Ž Limitations & Future Work

  • Spelling Variations: Moroccan Darija lacks standardized spelling. The model may generate slight differences (e.g., "بزاف" vs "بزّاف").
  • Code-Switching: Common Darija usage mixes in French and occasionally Spanish. The model currently prioritizes pure Darija but may benefit from code-switching support.
  • Niche Domains: Performance may vary for highly technical or domain-specific text. Future fine-tuning on specialized datasets could improve this.
  • Evaluation Scope: Current evaluation is based on 300 manually translated sentences across diverse fields. Expanding the dataset will strengthen benchmarks further.
  • Future Improvements:
    • Add multilingual code-switching training.
    • Expand context-specific datasets (education, health, e-commerce).
    • Release an interactive demo/Colab notebook for easier testing.

πŸ“¬ Contact

For feedback, contributions, or collaborations, please open an issue or reach out.

Downloads last month
8
Safetensors
Model size
354M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for oddadmix/English-Moroccan-Darija-v1

Base model

LiquidAI/LFM2-350M
Finetuned
(15)
this model
Quantizations
1 model

Space using oddadmix/English-Moroccan-Darija-v1 1