metadata

datasets:
  - atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset
language:
  - ar
pipeline_tag: feature-extraction

Moroccan Darija Embedding Models

This repository contains word embedding models trained for Moroccan Darija, a widely spoken Arabic dialect in Morocco. Currently, it includes FastText-based embeddings trained on the curated Al Atlas dataset composed of Moroccan Darija text.

Features

FastText embeddings: Pre-trained word vectors using FastText, which supports subword information and works well with dialectal and morphologically rich languages.
Efficient training pipeline: Code for training FastText embeddings on Moroccan Darija datasets.
Pre-trained models: Ready-to-use embeddings for downstream NLP tasks are available in the Hugging Face hub

Installation

Clone the Github repository and install the required dependencies:

git clone https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git
cd Moroccan-Darija-Embedding
pip install -r requirements.txt

Usage

Loading Pre-trained Embeddings

You can load the trained FastText model using gensim:

import fasttext

model = fasttext.load_model("fasttext_cbow_v0.bin") # download the models from the hub  https://huggingface.co/atlasia/Moroccan-Darija-Embedding
word_vector = model.get_word_vector("كلمة")

Roadmap

✅ FastText embeddings
⏳ Word2Vec and GloVe embeddings
⏳ Transformer-based contextual embeddings (e.g., BERT, RoBERTa)
⏳ Sentence embeddings: Continue training the MoRdern-Bert model.

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests to improve the models and codebase.