Arabic LLM Performance Tracker — Benchmark Scores Across Models and Evaluations

This dashboard tracks Arabic large language model performance across the major Arabic AI benchmarks, providing a consolidated view of model capabilities as evaluated by the Open Arabic LLM Leaderboard, ArabicMMLU, AraTrust, BALSAM, and SILMA ABB. Performance data is updated quarterly as new models are released and benchmark evaluations are published. All scores reference official OALL submissions or published benchmark results.

OALL Leaderboard — Top Performers (Q1 2026)

Rank	Model	Params	OALL Score	Organization
1	Falcon-H1 Arabic 34B	34B	75.36%	TII
2	Falcon-H1 Arabic 7B	7B	71.47%	TII
3	Jais 2 70B	70B	~73%	G42/MBZUAI
4	ALLaM 34B	34B	~70%	HUMAIN
5	Falcon-H1 Arabic 3B	3B	61.87%	TII

Scores are approximate where official OALL rankings have not been published. Exact rankings may vary by evaluation date.

Key Metrics

Metric	Current	Previous Quarter	Trend
OALL Models Submitted	700+	500+	Growing
Organizations	180+	140+	Growing
Top Score (34B class)	75.36%	~70%	Improving
Arabic-English Gap (MMLU)	~12 points	~15 points	Narrowing

Model Architecture Comparison

The performance data reveals architectural patterns that inform model selection decisions. Falcon-H1 Arabic’s hybrid Mamba-Transformer design achieves the highest scores at both 34B and 7B parameter counts, demonstrating that the state-space model hybrid architecture provides efficiency advantages for Arabic text processing. The 7B variant’s 71.47 percent score — matching models several times its size — validates the architectural thesis that SSM-transformer hybrids can substitute for pure parameter scaling.

Jais 2 at 70B parameters with 600 billion Arabic training tokens represents the maximum capability approach — the largest Arabic-first training dataset applied to the largest parameter count among Arabic LLMs. The model’s strength lies in dialect coverage (17 regional varieties plus Arabizi) rather than pure benchmark score, reflecting its design goal of serving 400 million Arabic speakers across diverse language varieties.

ALLaM 34B demonstrates competitive performance built on sovereign Saudi training data from 16 public entities, 300 Arabic books, and 400 subject matter experts. The model’s benchmark position reflects its focus on Saudi-specific applications and regulatory compliance rather than maximum benchmark optimization.

Performance by Task Category

Arabic LLM performance varies significantly across task categories, revealing capability patterns that aggregate scores conceal.

Factual Knowledge: Arabic-native models perform strongest on MENA-relevant factual knowledge, Arabic history, Islamic studies, and Arabic language questions. Performance gaps appear on specialized domains (advanced science, engineering, medicine) where Arabic training data is limited relative to English.

Reasoning: Mathematical and logical reasoning remains the weakest category for Arabic LLMs relative to English models. BALSAM’s 78 tasks include dedicated reasoning evaluation that quantifies this gap. Chain-of-thought prompting in Arabic improves reasoning scores but does not close the gap entirely.

Trustworthiness: AraTrust evaluation across eight dimensions shows that trustworthiness and accuracy are distinct capabilities. Models can answer Arabic questions correctly while generating culturally inappropriate content. GPT-4 scores highest on Arabic trustworthiness despite not being Arabic-native, suggesting that alignment methodology maturity matters more than Arabic-specific training for safety dimensions.

Generation Quality: Arabic-native models consistently outperform multilingual models on Arabic generation quality as rated by native speakers. AceGPT outperformed ChatGPT on the Vicuna-80 Arabic evaluation despite being dramatically smaller, confirming that Arabic-specific fine-tuning with cultural alignment produces more natural Arabic output than general multilingual training.

Parameter Efficiency Analysis

The OALL data enables parameter efficiency analysis that informs deployment decisions. Falcon-H1 Arabic 7B achieves 71.47 percent — 94.8 percent of the 34B model’s score at 20.6 percent of the parameters. This efficiency ratio makes the 7B model the optimal choice for organizations seeking production-grade Arabic AI at accessible compute costs. A single NVIDIA A100 (80GB) or consumer RTX 4090 can serve the 7B model with quantization, while the 34B model requires 2-4 A100 GPUs.

Falcon-H1 Arabic 3B at 61.87 percent provides 82.1 percent of the 34B model’s performance at 8.8 percent of the parameters — suitable for edge deployment, mobile applications, and high-throughput serving where latency matters more than maximum quality.

Benchmark Evolution Tracking

The OALL’s evolution from version 1 to version 2 represents a significant methodological improvement that affects score comparability across quarters. Version 1 included machine-translated evaluation tasks that inflated scores for models with English-translation capabilities. Version 2 uses exclusively native Arabic benchmarks — ArabicMMLU, ALRAGE, AraTrust, and MadinahQA — providing more accurate measurement of genuine Arabic language capability.

Quarter-over-quarter comparisons should account for this methodological change. Models evaluated under OALL v2 may show nominally lower scores than their v1 evaluations despite improved actual capability, because the native Arabic benchmarks are more demanding than translated tasks. The narrowing Arabic-English gap (from approximately 15 points to approximately 12 points on MMLU) reflects genuine capability improvement rather than benchmark inflation.

Future Tracking Dimensions

Future dashboard updates will incorporate additional dimensions as evaluation infrastructure expands. Dialectal performance disaggregation (reporting per-dialect scores rather than aggregate Arabic scores) will provide more actionable deployment guidance. Inference efficiency metrics (performance per FLOP, per dollar of compute) will enable cost-benefit analysis. Multimodal evaluation scores will track Arabic visual-language model development. Agent-level evaluation will assess models’ ability to function as autonomous Arabic AI agents rather than single-turn question answerers.

OALL Analysis — Leaderboard methodology and interpretation
Arabic LLMs — Comprehensive model profiles
Jais vs ALLaM vs Falcon — Head-to-head model comparison
Arabic vs English Performance — Cross-language gap analysis
ArabicMMLU Results — Native Arabic knowledge evaluation
AraTrust Evaluation — Trustworthiness assessment
MENA Funding Dashboard — Investment tracking