--- library_name: transformers datasets: - atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset language: - ar --- ## Description This tokenizer is designed for Moroccan Darija, a dialectal variety of Arabic (ISO code: `ary`). It has been trained using the Byte Pair Encoding (BPE) algorithm on the dataset: [atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset](https://huggingface.co/datasets/atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset). ## Features - Tokenizes Moroccan Darija text efficiently, see Moroccan darija [leaderboard](https://huggingface.co/spaces/atlasia/darija-tokenizers-leaderboard). - Provides robust handling of dialectal variations and specific features of Moroccan Darija.