OALL Benchmark Analysis — Open Arabic LLM Leaderboard Methodology and Results

The Open Arabic LLM Leaderboard, launched in May 2024 by the Arabic AI Initiative, TII, and Hugging Face, provides the most widely referenced evaluation framework for Arabic language models. With over 700 models submitted from more than 180 organizations, OALL has become the de facto standard for comparing Arabic LLM quality.

OALL v2 represents a significant methodological improvement over v1. The original leaderboard included machine-translated evaluation tasks that introduced translation artifacts into scoring. OALL v2 removed all machine-translated tasks and replaced them with native Arabic benchmarks: ArabicMMLU, ALRAGE, AraTrust, and MadinahQA. This transition to fully native evaluation ensures that model scores reflect genuine Arabic language competence rather than the ability to process translated text.

The current leaderboard rankings show Falcon-H1 Arabic 34B in the lead position at 75.36 percent, demonstrating that the hybrid Mamba-Transformer architecture achieves superior Arabic performance at moderate parameter counts. The 7B and 3B Falcon-H1 Arabic models also outperform competitors in their respective size classes, with the 3B model scoring approximately 10 points higher than competing 4B systems.

V2 Benchmark Components

ArabicMMLU provides the knowledge evaluation foundation with 14,575 native Arabic multiple-choice questions curated by Koto et al. from educational exams administered across Arab countries. The benchmark covers all school levels through university across STEM, social sciences, humanities, and Arabic language understanding. Its sourcing from actual Arabic educational materials ensures that questions test genuine Arabic knowledge rather than the ability to process translated English content. Domain-specific performance variation — models excelling at STEM while struggling with Arabic linguistic concepts — reveals the depth and specificity of models’ Arabic training data.

ALRAGE evaluates retrieval-augmented generation performance, testing models’ ability to retrieve relevant Arabic passages and generate accurate responses grounded in the retrieved content. This benchmark is particularly relevant for enterprise Arabic AI deployment, where RAG systems are essential for connecting general-purpose LLMs to organization-specific Arabic knowledge bases. Strong ALRAGE performance indicates that a model can serve as an effective generation backbone for Arabic RAG applications.

AraTrust assesses trustworthiness across eight dimensions: truthfulness, ethics, privacy, illegal activities, mental health, physical health, unfairness, and offensive language. Published at COLING 2025 in Abu Dhabi, AraTrust’s 522 human-written questions revealed that GPT-4 achieved the highest trustworthiness scores, while open-source Arabic models like AceGPT 7B and Jais 13B scored below 60 percent. The inclusion of AraTrust in OALL v2’s composite score means that models cannot achieve top rankings without demonstrating safety alongside capability.

MadinahQA evaluates Islamic and cultural knowledge, testing understanding of content deeply rooted in Arabic-Islamic intellectual tradition. This benchmark fills a gap that even comprehensive academic evaluation (ArabicMMLU) may miss — knowledge of Islamic jurisprudence, Quranic interpretation, Hadith sciences, and Islamic history that is fundamental to Arabic-speaking societies but absent from Western educational frameworks.

Leaderboard Architecture and Submission Process

The OALL accepts model submissions through Hugging Face, providing standardized evaluation infrastructure that enables consistent comparison across models. The submission process includes automated benchmark execution, score computation, and leaderboard ranking — eliminating the inconsistency that manual self-reported benchmarks introduce.

The leaderboard architecture supports multiple model categories: base models (pre-trained without instruction tuning), chat models (instruction-tuned for conversational interaction), and specialized models (fine-tuned for specific tasks). This categorization enables fair comparison within model types — comparing a base Jais model against a base Falcon model rather than against an instruction-tuned ALLaM variant that benefits from additional training.

The ecosystem scope is reflected in the submission volume: over 700 models from more than 180 organizations across the global Arabic AI community. Submissions include models developed in the UAE, Saudi Arabia, Qatar, Egypt, Morocco, China, the United States, Europe, and other regions, providing a comprehensive snapshot of global Arabic AI capability. The diversity of submitting organizations — spanning government research institutes, universities, private companies, and individual researchers — confirms that the OALL serves the entire Arabic AI community rather than a narrow subset.

Score Interpretation and Deployment Implications

OALL composite scores require careful interpretation for deployment decisions. The composite averages across four very different evaluation dimensions — knowledge, retrieval-augmented generation, trustworthiness, and cultural knowledge — and a model’s composite score may mask significant variation across dimensions. An organization deploying Arabic AI for healthcare should weight AraTrust (safety) and ArabicMMLU STEM (knowledge accuracy) more heavily than MadinahQA (cultural knowledge). An organization deploying Arabic AI for educational content should prioritize ArabicMMLU humanities and Arabic language understanding scores.

The OALL’s size-class stratification enables efficiency-aware evaluation. Falcon-H1 Arabic’s 3B model achieving 61.87 percent outperforms several 4B+ models, demonstrating that architectural innovation (hybrid Mamba-Transformer) can compensate for parameter count. The 7B model at 71.47 percent exceeds some 9-10B models. And the 34B flagship at 75.36 percent exceeds 70B+ pure transformers. These cross-size comparisons inform deployment decisions where computational cost matters — deploying a 7B model at one-tenth the cost of a 70B model while achieving comparable performance represents significant economic advantage.

Contamination risk affects score interpretation. Models showing significantly higher OALL scores than BALSAM scores (private test sets) likely benefit from benchmark data memorization. Organizations making deployment decisions should consider BALSAM and SILMA ABB scores alongside OALL scores to validate that performance reflects genuine capability.

Future Evolution

The OALL’s trajectory suggests continued evolution toward more comprehensive and rigorous Arabic evaluation. The v1-to-v2 transition — removing translated benchmarks in favor of native Arabic evaluation — established a precedent for methodological refinement. Future versions may incorporate multi-modal evaluation (Arabic image understanding, Arabic video comprehension), Arabic code generation assessment, Arabic agent evaluation benchmarks, and dialect-specific performance stratification.

The Arabic evaluation ecosystem’s growth from a handful of translated benchmarks to over 40 native Arabic evaluation frameworks in under three years reflects the rapid maturation of the field. As Arabic AI deployment expands from research prototypes to production systems serving millions of users, evaluation rigor becomes increasingly critical — and the OALL’s role as the community standard for Arabic LLM comparison positions it as the foundation of this evaluation infrastructure.

OALL Version 2 Architecture and Evaluation Philosophy

The OALL’s transition from version 1 to version 2 represents a philosophical shift in Arabic AI evaluation. Version 1 included machine-translated benchmarks — English evaluation tasks translated to Arabic — that inadvertently advantaged models trained on translated content. These models could score well on translated benchmarks by recognizing translation artifacts rather than demonstrating genuine Arabic language understanding. Version 2 eliminated translated tasks entirely, replacing them with four native Arabic benchmarks that test Arabic language capability directly.

ArabicMMLU provides the knowledge evaluation dimension with 14,575 questions sourced from educational exams across Arab countries. The questions cover STEM, social sciences, humanities, and Arabic language at all school levels through university — testing the breadth of knowledge that Arabic speakers acquire through Arabic-medium education. Models performing well on ArabicMMLU demonstrate knowledge coverage that extends beyond the narrow domains that specialized training might produce.

ALRAGE evaluates retrieval-augmented generation — the ability to retrieve relevant Arabic information and integrate it into coherent responses. This benchmark directly assesses the RAG capability that enterprise Arabic AI deployments depend on, making ALRAGE performance a practical predictor of deployment quality for organizations implementing Arabic knowledge retrieval systems.

AraTrust’s 522 human-written questions evaluate trustworthiness across eight dimensions: truthfulness, ethics, privacy, illegal activities, mental health, physical health, unfairness, and offensive language. This evaluation separates accuracy from trustworthiness — revealing that some models that score well on knowledge benchmarks perform poorly on safety and cultural alignment dimensions. GPT-4 scored highest overall, while some open-source Arabic models scored below 60 percent.

MadinahQA evaluates Islamic and cultural knowledge — a dimension unique to Arabic AI evaluation that reflects the cultural context within which Arabic LLMs operate. This benchmark tests knowledge that is foundational to Arabic-speaking societies but absent from Western AI evaluation frameworks.

OALL Community Impact and Ecosystem Development

The OALL’s impact extends beyond evaluation to ecosystem development. The 700+ model submissions from 180+ organizations demonstrate community engagement at a scale that validates the open evaluation approach. Researchers submit fine-tuned variants, architectural experiments, and novel training approaches — each submission contributing to the community’s understanding of what drives Arabic AI capability.

The leaderboard creates competitive dynamics that accelerate model improvement. Organizations seeing competitors achieve higher OALL scores invest in training data quality, architectural innovation, and evaluation methodology to improve their own rankings. This competitive pressure benefits Arabic speakers by ensuring that multiple organizations invest in Arabic AI capability rather than treating Arabic as a secondary consideration.

The OALL’s open-sourced evaluation code enables reproducible assessment that prevents evaluation gaming. Organizations can verify their own results before submission and verify competitors’ results after publication. This transparency — enabled by the open-weight availability of participating models — provides evaluation integrity that proprietary model leaderboards cannot match.

For the MENA AI startup ecosystem — $858 million in AI VC during 2025, 664 AI companies in Saudi Arabia — the OALL provides standardized evaluation criteria that inform model selection decisions. Startups building Arabic AI applications can evaluate candidate foundation models against OALL benchmarks, selecting the model whose evaluation profile best matches their application requirements. This standardized evaluation reduces the model selection risk that early Arabic AI adopters faced when benchmarking infrastructure was limited.

OALL’s Role in Arabic AI Market Development

The OALL’s standardized evaluation framework serves a market development function beyond technical assessment. Investors evaluating Arabic AI startups use OALL scores as objective capability indicators. Enterprise customers selecting Arabic LLM providers reference OALL rankings as part of procurement decisions. Government agencies specifying Arabic AI requirements cite OALL benchmarks as minimum performance standards. This market infrastructure function — providing the common evaluation language that buyers, sellers, and investors use to communicate about Arabic AI capability — accelerates commercial Arabic AI adoption by reducing information asymmetry between model developers and model deployers.

The leaderboard’s evolution from version 1 (including translated tasks) to version 2 (exclusively native Arabic benchmarks) demonstrates that evaluation infrastructure must evolve with the field it evaluates. Future OALL versions may add multimodal evaluation, dialectal disaggregation (reporting per-dialect performance rather than aggregate scores), agentic capability assessment, and efficiency metrics (performance per parameter, per FLOP, per dollar). Each evolution will reshape Arabic AI competitive dynamics by defining what matters — ensuring that Arabic AI development optimizes for genuine Arabic language capability rather than benchmark-specific tricks.

Multi-Track Evaluation Beyond LLM Performance

The OALL evaluates Arabic AI across multiple tracks beyond the core LLM leaderboard. The embedding track evaluates Arabic text embedding quality — critical for RAG systems that depend on accurate semantic representation. The retrieval track evaluates document retrieval accuracy for Arabic queries. The RAG generation track evaluates end-to-end retrieval-augmented generation quality. The speech-to-text track evaluates Arabic ASR accuracy. The OCR track evaluates Arabic optical character recognition.

This multi-track structure reflects the reality that Arabic AI deployment involves multiple capabilities working together. An Arabic voice agent requires ASR accuracy (speech track), language understanding (LLM track), knowledge retrieval (retrieval/RAG tracks), and text representation (embedding track). Multi-track evaluation enables organizations to assess models across all relevant capabilities rather than optimizing solely for generation quality.

The 700+ submissions from 180+ organizations across all tracks create the most comprehensive public database of Arabic AI capability assessment available. This scale enables meta-analyses revealing ecosystem-wide trends — Arabic AI capability progression over time, architectural approach convergence, and the relationship between model scale and Arabic-specific performance.

Institutional Impact on Arabic AI Development

The OALL’s institutional impact extends beyond technical evaluation. By establishing a common measurement framework, the OALL enables meaningful competition between Arabic LLM developers — G42/MBZUAI with Jais, HUMAIN with ALLaM, and TII with Falcon — that drives rapid improvement. Each leaderboard update catalyzes investment in model development as organizations seek to maintain or improve their ranking. This competitive dynamic, enabled by standardized evaluation, has accelerated Arabic AI development faster than any single organization could achieve independently.

Arabic AI Benchmarks — Full benchmark coverage
Arabic LLMs — Model performance context
ArabicMMLU Results — Knowledge evaluation detail
AraTrust Evaluation — Trustworthiness assessment
BALSAM Benchmark — Contamination-resistant evaluation
SILMA Arabic Broad Benchmark — Human-validated assessment
Falcon Arabic — OALL leaderboard leader
Jais — Arabic LLM — Top competitor analysis

Arabic BenchmarksEvaluationArabic AI