A new benchmark called 3LM has been introduced to evaluate large language models (LLMs) in Arabic on science, technology, engineering, mathematics, and coding tasks. The benchmark aims to address the lack of standardized evaluation tools for Arabic LLMs in technical domains, providing a comprehensive suite of tests covering subjects such as physics, chemistry, biology, mathematics, and programming. Early results indicate that while some models perform well on general Arabic tasks, they significantly lag in STEM and coding proficiency. The developers hope 3LM will drive improvements in Arabic AI models for education and professional applications.
New Benchmark 3LM Assesses Arabic LLMs on STEM and Code Tasks
AI
April 26, 2026 · 4:11 PM