BounharAbdelaziz's picture
Update README.md
486afd7 verified
metadata
library_name: transformers
datasets:
  - atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset
language:
  - ar

Description

This tokenizer is designed for Moroccan Darija, a dialectal variety of Arabic (ISO code: ary).
It has been trained using the Byte Pair Encoding (BPE) algorithm on the dataset: atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset.

Features

  • Tokenizes Moroccan Darija text efficiently, see Moroccan darija leaderboard.
  • Provides robust handling of dialectal variations and specific features of Moroccan Darija.