OALL/Open-Arabic-LLM-Leaderboard · Leaderboard is Stuck

Jun 26

Hello, I noticed that the pending and running tasks on the leaderboard have been stuck for some weeks. Could you please advise on when this issue might be resolved? Thank you for your help!

amztheory

Open Arabic LLM Leaderboard org Jun 26

@yellowtown thanks for raising it! We have addressed the issue! Most of the models were actually evaluated but the UI was not reflecting it.

yellowtown

Jun 27

Hello,
I just submitted an evaluation request for the Qwen3-32B model, but I received a warning: "Warning: 'Qwen/Qwen3-32B' with (rev='main', prec='bfloat16') is already in PENDING." However, I do not see this model in either the PENDING list or the RUNNING list. To avoid submitting duplicate requests, could you please confirm whether Qwen3-32B is currently in the queue?

Also, could you please let me know approximately when the results for this model might appear on the leaderboard?

Thank you.

amztheory

Open Arabic LLM Leaderboard org Jun 27

@yellowtown It is in the QUEUE. Should be in the leaderboard next week!

yellowtown

Jul 4

Hi,

I submitted a new model called deep-analysis-research/D2IL-Arabic-Qwen2.5-72B-Instruct-v0.1 five days ago (on June 29). I have noticed that during these five days, the model has remained in the PENDING queue and has not moved to RUNNING. Is there any issue with this model? If so, could you please let me know what the problem is? Thank you very much.

amztheory

Open Arabic LLM Leaderboard org Jul 4

Hi @yellowtown
It will be evaluated soon! It is in the queue! Just bear with us!
Thanks for your patience and showing interest in the leaderboard

amztheory

Open Arabic LLM Leaderboard org Jul 10

HI @yellowtown
Your model was added to the leaderboard
Thanks for your patience

amztheory changed discussion status to closed Jul 10

yellowtown

Jul 11

•

edited Jul 11

Hi @amztheory

Thanks for your great work. I noticed that the leaderboard results are about 15% lower across all datasets compared to my own local testing results.

	Avg	AlGhafa	ArabicMMLU	EXAMS	MadinahQA	AraTrust	ALRAGE	ArbMMLU-HT
Leaderboard Results	60.24	67.1	59.31	42.83	59.44	57.92	77.65	57.44
Local Test	75.94	78.55	75.21	59.22	76.9	89.86	77.81	74.09

It is very weird. This model is fine-tuned for Arabic based on the open-sourced Qwen2.5-72B-Instruct, so there should not be such a significant difference compared to the open-source model (72.39 in the leaderboard).
The detailed report of my local evaluation results is uploaded on Hugging Face for your reference (https://huggingface.co/datasets/deep-analysis-research/details_D2IL-Arabic-Qwen2.5-72B-Instruct-v0.1)
I have checked the lighteval evaluations for both vllm and transformers backends, but neither matches the leaderboard results.
And I have also tried to download the model from the Hugging Face and rerun the evaluation, but the results are still different from the leaderboard.

Delving into the details, the predictions recorded in the leaderboard results are inconsistent with my local testing, which may be the main reason for the observed 15% performance drop. For example, consider the following case:

السؤال التالي هو سؤال متعدد الإختيارات. اختر الإجابة الصحيحة:\n\nاألصول ذات الطبٌعة النقدٌة أو المنتظر تحولها إلى نقدٌة خالل سنة مالٌة\nأ. األصول الثابتة الملموسة\nب. األصول المتداولة\nج. األصول غٌر الملموسة\nد. الشىء مما سبق\nالإجابة:

When using either Transformers or vllm backends, the predicted logits for ['أ', 'ب', 'ج', 'د'] are [-11.76287841796875, -6.820357322692871, -11.485189437866211, -9.27004623413086], with results from both frameworks differing by less than 0.1. However, the leaderboard results recorded in the downloaded parquet file report logits of [-5.15625, -6.78125, -11.5625, -7.65625], which are significantly different. Interestingly, this discrepancy is observed in roughly 20% of the cases, while the majority of the logit results remain consistent between the two sources. This inconsistency is puzzling and may be a key factor in the performance degradation.

Given these results, there is certainly some mismatches in the evaluation settings between my local test and the leaderboard's test.

GPU/Code: The local evaluation is run on 4xA100 (80GB) GPUs with the latest lighteval . The evaluation command is as follows:

export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
    "model_name=$model_dir,dtype=bf16,gpu_memory_utilization=0.9,max_model_length=4096,tensor_parallel_size=4" \
    "examples/tasks/OALL_v2_tasks.txt" \
    --custom-tasks community_tasks/arabic_evals.py \
    --output-dir $output_dir \
    --save-details

Can you provide the specifics of the evaluation hardware (GPU model), lighteval code version, library version like transformers (pip list or docker image) and the exact evaluation command used by the leaderboard?

FAILED_BENCH? I noticed that in the commit history of OALL/requests_v2 regarding deep-analysis-research/D2IL-Arabic-Qwen2.5-72B-Instruct-v0.1, there was once a FAILED_BENCH status (https://huggingface.co/datasets/OALL/requests_v2/commit/06f005c85d262e93b1eb331ae1de82fcb93c31a4). Is it possible that there was an error or anomaly during the leaderboard evaluation process?

I would really appreciate it if you could help me check and clarify the possible causes for this discrepancy.

amztheory

Open Arabic LLM Leaderboard org Jul 11

@yellowtown
Regarding 1, I'll be investigating your concerns and will update you
For the 2nd comment, well done for spotting the fact that your model was set to "FAILED BENCH" for some time, but in reality nothing really failed related to the eval run of your model hence why eventually it became set to finished
Thanks for raising your concerns and I should keep you updated

amztheory

Open Arabic LLM Leaderboard org Jul 18

@yellowtown
Over the week, I investigated the scores reported and I agree scores for all benchmarks, apart from alrage were not correct, which prompts us to hide the scores for your model.

Subsequent to modifying our backend code, we rerun the evaluation on your model and we are getting almost identical scores to what you have reported in your previous message.
The issue that we identified with the backend seems to only impact large scale models and your model appears to be the first >70B model to encounter this, due to the fact that we have changed our evaluation environment recently.
Really appreciate you pointing it out to avoid any future discrepancies!
Scores have been updated in the leaderboard for your model, and we would like to say congrats on obtaining the top position in the leaderboard.
Keep up the good work @yellowtown
Thanks

yellowtown

Jul 18

@amztheory
Thank you so much for your support—this truly means a lot to me!
I am execited that my model is now ranked first on the leaderboard, as this validates the effectiveness of my technical approach. I will continue to investigate and hope to contribute more on Arabic LLMs .

Looking forward to making the Arabic LLM community more active and prosperous together!