Papers
arxiv:2409.17912

Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

Published on Sep 26
· Submitted by guokan-shang on Oct 2
Authors:
,
,
,
,
,

Abstract

We introduce Atlas-Chat, the first-ever collection of large language models specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-9B and 2B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., achieving a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource language variants, which are often neglected in favor of data-rich languages by contemporary LLMs.

Community

Paper author Paper submitter
edited Oct 2

Atlas-Chat is the first collection of LLMs developed for Moroccan Arabic (Darija), and for low-resource dialectal Arabic in general.
https://huggingface.co/MBZUAI-Paris/Atlas-Chat-9B

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 12

Browse 12 models citing this paper

Datasets citing this paper 9

Browse 9 datasets citing this paper

Spaces citing this paper 3

Collections including this paper 4