๐ 3LM: A Benchmark for Arabic LLMs in STEM and Code
Why 3LM?
Arabic Large Language Models (LLMs) have seen notable progress in recent years, yet existing benchmarks fall short when it comes to evaluating performance in high-value technical domains. Most evaluations to date have focused on general-purpose tasks like summarization, sentiment analysis, or generic question answering. However, scientific reasoning and programming are essential for a broad range of real-world applications, from education to technical problem-solving.
To address this gap, we introduce 3LM (ุนูู ), a multi-component benchmark tailored to evaluate Arabic LLMs on STEM (Science, Technology, Engineering, and Mathematics) subjects and code generation. 3LM is the first benchmark of its kind, designed specifically to test Arabic models in structured reasoning and formal logic which are domains traditionally underrepresented in Arabic NLP.
Whatโs in the Benchmark?
3LM is made up of three datasets, each targeting a specific evaluation axis: real-world multiple-choice STEM questions (MCQs), synthetic high-difficulty STEM questions, and translated code generation tasks.
1. Native STEM
The Native STEM benchmark consists of 865 MCQs extracted from authentic Arabic educational content, including textbooks, worksheets, and exam banks for grades 8 through 12. Questions span five core subjects: Physics, Chemistry, Biology, Mathematics, and Geography.
Each question is annotated with metadata including domain and difficulty (on a 1โ10 scale). The data was sourced using a pipeline that combined OCR (including LaTeX math parsing via Pix2Tex), LLM-assisted question-answer extraction, and manual review. This dataset provides a realistic testbed for evaluating factual and conceptual understanding in Arabic models using real educational materials.
2. Synthetic STEM
To introduce greater challenge and diversity, we created a synthetic subset of 1,744 MCQs using the YourBench pipeline. This component draws from Arabic textbook text, which is chunked, summarized, and used as input to an LLM-driven question generation system. The result is a curated set of questions focused on mid-to-high difficulty reasoning, including conceptual, analytical, and application-based problems.
Synthetic STEM provides an important counterbalance to native MCQs by probing deeper reasoning skills and minimizing answer bias. All generated questions underwent filtering based on clarity, structure, and content validity, followed by quality assurance via manual review.
3. Arabic Code Benchmarks
The third component of 3LM targets code generation which is a growing area of LLM evaluation. We translated and adapted the widely-used HumanEval+ and MBPP+ benchmarks into Arabic, creating the first code datasets that test Arabic LLMs on natural language prompts for programming.
We used GPT-4o for prompt translation and validated the results with a backtranslation pipeline, rejecting low-quality samples based on ROUGE-L F1 thresholds (< 0.8). Additional human filtering ensured prompt clarity and correctness. The code and test suites remain unchanged to preserve scoring fidelity. Evaluations use the EvalPlus framework for pass@1 and pass@1+ metrics.
Building the Benchmark
Each dataset in 3LM went through a multi-stage development process to ensure data quality, fairness, and representativeness.
For Native STEM, we collected Arabic PDF sources and applied a dual OCR approach to recover both plain text and math formulas. Questions were extracted using LLM-based chunking and pattern recognition, followed by classification into MCQ format with randomized answer order. Final samples were reviewed by native Arabic speakers with STEM expertise to confirm answer validity and readability.
For Synthetic STEM, the YourBench pipeline was adapted for Arabic input. Source documents after ingestion were first summarized, chunked and then fed to a code-controlled generator for MCQ creation. We filtered out image-dependent or ambiguous content, and only retained questions within target difficulty ranges. The result is a set of clean, high-quality synthetic Arabic MCQs for STEM.
For the Code Benchmarks, our goal was to isolate language understanding while preserving code logic. Prompt translation was handled by GPT-4o with verification via reverse translation. Code and tests were untouched to allow evaluation parity with English versions. The result is a benchmark where Arabic prompts can be evaluated directly with the EvalPlus toolchain.
Key Results
We evaluated over 40 LLMs, including Arabic-first, multilingual, and general-purpose base and instruction-tuned models. Evaluation was performed using both multiple-choice accuracy and generative completion metrics.
In the MCQ setting, Qwen2.5-72B-Instruct achieved top performance across both native (71.8%) and synthetic (67.0%) STEM subsets. For completion tasks, Gemma-3-27B showed the strongest results with 43.2% accuracy on STEM answers.
In code generation, GPT-4o demonstrated best-in-class performance on both HumanEval-ar (83.5% pass@1+) and MBPP-ar (63.6% pass@1+). These results highlight strong correlation (~0.97) between Arabic and English pass@1 scores, suggesting language-specific prompt quality has a major influence on model outcomes.
We also examined robustness under distractor perturbation, revealing that instruction-tuned models are significantly more stable than their base counterparts. Prompt engineering and zero-shot design were also shown to meaningfully affect Arabic STEM performance.
Evaluation Tooling
We built the benchmark to be easily reproducible using standard tools:
lighteval
handles both multiple-choice and open-ended question evaluation for STEM datasets.evalplus
powers robust pass@1 and pass@1+ code scoring using function-level testing.
All scripts, configs, and evaluation pipelines are available in our GitHub repository, and can be adapted to evaluate any model compatible with HuggingFace Transformers or OpenAI APIs.
Access the Datasets
All three datasets are open-source and hosted on HuggingFace Datasets:
Citation
If you use 3LM in your research, please cite us:
@article{boussaha2025threeLM,
title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
journal={arXiv preprint arXiv:2507.15850},
year={2025}
}