|
--- |
|
datasets: |
|
- atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset |
|
language: |
|
- ar |
|
pipeline_tag: feature-extraction |
|
--- |
|
# Moroccan Darija Embedding Models |
|
|
|
This repository contains word embedding models trained for Moroccan Darija, a widely spoken Arabic dialect in Morocco. Currently, it includes FastText-based embeddings trained on the curated [Al Atlas dataset](https://huggingface.co/datasets/atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset) composed of Moroccan Darija text. |
|
|
|
## Features |
|
- **FastText embeddings**: Pre-trained word vectors using FastText, which supports subword information and works well with dialectal and morphologically rich languages. |
|
- **Efficient training pipeline**: Code for training FastText embeddings on Moroccan Darija datasets. |
|
- **Pre-trained models**: Ready-to-use embeddings for downstream NLP tasks are available in the [Hugging Face hub](https://huggingface.co/atlasia/Moroccan-Darija-Embedding) |
|
|
|
## Installation |
|
Clone the [Github repository](https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git) and install the required dependencies: |
|
|
|
```bash |
|
git clone https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git |
|
cd Moroccan-Darija-Embedding |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Usage |
|
### Loading Pre-trained Embeddings |
|
You can load the trained FastText model using `gensim`: |
|
|
|
```python |
|
import fasttext |
|
|
|
model = fasttext.load_model("fasttext_cbow_v0.bin") # download the models from the hub https://huggingface.co/atlasia/Moroccan-Darija-Embedding |
|
word_vector = model.get_word_vector("كلمة") |
|
``` |
|
|
|
## Roadmap |
|
- ✅ FastText embeddings |
|
- ⏳ Word2Vec and GloVe embeddings |
|
- ⏳ Transformer-based contextual embeddings (e.g., BERT, RoBERTa) |
|
- ⏳ Sentence embeddings: Continue training the [MoRdern-Bert](https://github.com/BounharAbdelaziz/MorDern-Bert) model. |
|
|
|
## Contributing |
|
Contributions are welcome! Feel free to open issues or submit pull requests to improve the models and codebase. |