metadata
datasets:
- atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset
language:
- ar
pipeline_tag: feature-extraction
Moroccan Darija Embedding Models
This repository contains word embedding models trained for Moroccan Darija, a widely spoken Arabic dialect in Morocco. Currently, it includes FastText-based embeddings trained on the curated Al Atlas dataset composed of Moroccan Darija text.
Features
- FastText embeddings: Pre-trained word vectors using FastText, which supports subword information and works well with dialectal and morphologically rich languages.
- Efficient training pipeline: Code for training FastText embeddings on Moroccan Darija datasets.
- Pre-trained models: Ready-to-use embeddings for downstream NLP tasks are available in the Hugging Face hub
Installation
Clone the Github repository and install the required dependencies:
git clone https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git
cd Moroccan-Darija-Embedding
pip install -r requirements.txt
Usage
Loading Pre-trained Embeddings
You can load the trained FastText model using gensim
:
import fasttext
model = fasttext.load_model("fasttext_cbow_v0.bin") # download the models from the hub https://huggingface.co/atlasia/Moroccan-Darija-Embedding
word_vector = model.get_word_vector("كلمة")
Roadmap
- ✅ FastText embeddings
- ⏳ Word2Vec and GloVe embeddings
- ⏳ Transformer-based contextual embeddings (e.g., BERT, RoBERTa)
- ⏳ Sentence embeddings: Continue training the MoRdern-Bert model.
Contributing
Contributions are welcome! Feel free to open issues or submit pull requests to improve the models and codebase.