atlasia/Atlaset
Viewer
•
Updated
•
1.17M
•
270
•
18
A collection of all available datasets for pretraining LLMs
Note A collection of moroccan darija texts (155M tokens). Can be used for pretraining Moroccan Darija LMs.
Note A culturally aligned translation benchmark for evaluating Machine Translation for Moroccan Darija.
Note A collection of 12,743 parallel text and speech samples for Moroccan Darija, including its transcription in both Latin and Arabic scripts and English translations.
Note A collection of 10,044 parallel text samples of Moroccan Darija sourced from Darija Wikipedia.
Note A collection of 551 parallel text and speech samples of Moroccan Darija sourced from Wikipedia Darija.