Malaysian Llama-3.1-8B-Instruct

Continue finetuning https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct on highly curated 1.5B tokens Malaysian instruction dataset.

Improvement

Support respond in Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan and Terengganu.
Able to code in Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan and Terengganu.
Multi-turn Malaysian context such as related to Malaysian Legislation, politics, religions and languages.

Training session

Finetune on mesolitica/Malaysian-SFT to make the model understand Malaysian context.

How we train

LoRA on ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head"].
128 Rank with alpha 256, or alpha of 2.0
Multipacking 8192 context length with proper SDPA causal masking to prevent document contamination and also make sure proper position ids.
Chunk CCE loss for LoRA.
WanDB at https://wandb.ai/huseinzol05/lora-embedding-128-llama3.1-8b-malaysian-8k?nw=nwuserhuseinzol05

Source code at https://github.com/mesolitica/malaya/tree/master/session/llama3

Benchmark

MalayMMLU

Probability next tokens

Based on 0-shot official MalayMMLU First token accuracy,

                             Model   Accuracy   shot by_letter        category
0  Malaysian-Llama-3.1-8B-Instruct  61.522718  0shot      True            STEM
1  Malaysian-Llama-3.1-8B-Instruct  61.784351  0shot      True        Language
2  Malaysian-Llama-3.1-8B-Instruct  60.610003  0shot      True  Social science
3  Malaysian-Llama-3.1-8B-Instruct  60.254258  0shot      True          Others
4  Malaysian-Llama-3.1-8B-Instruct  62.434585  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : Malaysian-Llama-3.1-8B-Instruct
Metric : first
Shot : 0shot
average accuracy 61.276999958699875
accuracy for STEM 61.522717969709376
accuracy for Language 61.784351145038165
accuracy for Social science 60.61000289100896
accuracy for Others 60.254257615735185
accuracy for Humanities 62.43458475540387

While the original model,

                   Model   Accuracy   shot by_letter        category
0  Llama-3.1-8B-Instruct  64.019648  0shot      True            STEM
1  Llama-3.1-8B-Instruct  65.505725  0shot      True        Language
2  Llama-3.1-8B-Instruct  62.604799  0shot      True  Social science
3  Llama-3.1-8B-Instruct  62.197170  0shot      True          Others
4  Llama-3.1-8B-Instruct  67.167235  0shot      True      Humanities
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443}
Model : Llama-3.1-8B-Instruct
Metric : first
Shot : 0shot
average accuracy 64.25886920249452
accuracy for STEM 64.0196479738027
accuracy for Language 65.5057251908397
accuracy for Social science 62.60479907487713
accuracy for Others 62.197169585032384
accuracy for Humanities 67.16723549488054

First token match using vLLM

Based on 0-shot exact first token match using vLLM Guided Decoding,

                             Model   Accuracy  shot        category
0  Malaysian-Llama-3.1-8B-Instruct  58.616455     0            STEM
1  Malaysian-Llama-3.1-8B-Instruct  60.178117     0        Language
2  Malaysian-Llama-3.1-8B-Instruct  57.213067     0  Social science
3  Malaysian-Llama-3.1-8B-Instruct  56.896138     0          Others
4  Malaysian-Llama-3.1-8B-Instruct  59.704209     0      Humanities
Model : Malaysian-Llama-3.1-8B-Instruct
Metric : full
Shot : 0
average accuracy 58.5222814190724
accuracy for STEM 58.616455178059766
accuracy for Language 60.17811704834606
accuracy for Social science 57.213067360508816
accuracy for Others 56.89613816262893
accuracy for Humanities 59.70420932878271

While the original model,

                   Model   Accuracy  shot        category
0  Llama-3.1-8B-Instruct  58.739255     0            STEM
1  Llama-3.1-8B-Instruct  61.577608     0        Language
2  Llama-3.1-8B-Instruct  57.487713     0  Social science
3  Llama-3.1-8B-Instruct  56.872152     0          Others
4  Llama-3.1-8B-Instruct  63.890785     0      Humanities
Model : Llama-3.1-8B-Instruct
Metric : full
Shot : 0
average accuracy 59.73237517036303
accuracy for STEM 58.73925501432665
accuracy for Language 61.57760814249363
accuracy for Social science 57.487713211910965
accuracy for Others 56.872151595106736
accuracy for Humanities 63.89078498293516

Acknowledgement

Special thanks to https://www.sns.com.my and Nvidia for 8x H100 node!

mesolitica
/

Malaysian-Llama-3.1-8B-Instruct