atlasia
/

Moroccan-Darija-Embedding

Feature Extraction

Model card Files Files and versions Community

Moroccan-Darija-Embedding / README.md

BounharAbdelaziz's picture

BounharAbdelaziz

Create README.md

c6b74ae verified 12 days ago

|

history blame contribute delete

1.96 kB

	---
	datasets:
	- atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset
	language:
	- ar
	pipeline_tag: feature-extraction
	---
	# Moroccan Darija Embedding Models

	This repository contains word embedding models trained for Moroccan Darija, a widely spoken Arabic dialect in Morocco. Currently, it includes FastText-based embeddings trained on the curated [Al Atlas dataset](https://huggingface.co/datasets/atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset) composed of Moroccan Darija text.

	## Features
	- FastText embeddings: Pre-trained word vectors using FastText, which supports subword information and works well with dialectal and morphologically rich languages.
	- Efficient training pipeline: Code for training FastText embeddings on Moroccan Darija datasets.
	- Pre-trained models: Ready-to-use embeddings for downstream NLP tasks are available in the [Hugging Face hub](https://huggingface.co/atlasia/Moroccan-Darija-Embedding)

	## Installation
	Clone the [Github repository](https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git) and install the required dependencies:

	```bash
	git clone https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git
	cd Moroccan-Darija-Embedding
	pip install -r requirements.txt
	```

	## Usage
	### Loading Pre-trained Embeddings
	You can load the trained FastText model using `gensim`:

	```python
	import fasttext

	model = fasttext.load_model("fasttext_cbow_v0.bin") # download the models from the hub https://huggingface.co/atlasia/Moroccan-Darija-Embedding
	word_vector = model.get_word_vector("كلمة")
	```

	## Roadmap
	- ✅ FastText embeddings
	- ⏳ Word2Vec and GloVe embeddings
	- ⏳ Transformer-based contextual embeddings (e.g., BERT, RoBERTa)
	- ⏳ Sentence embeddings: Continue training the [MoRdern-Bert](https://github.com/BounharAbdelaziz/MorDern-Bert) model.

	## Contributing
	Contributions are welcome! Feel free to open issues or submit pull requests to improve the models and codebase.