BALSAM Benchmark — Comprehensive 78-Task Arabic AI Evaluation

BALSAM provides the most comprehensive evaluation framework for Arabic AI, spanning 78 distinct tasks across 52,000 samples. The benchmark’s design addresses a critical weakness of public benchmarks: contamination. By maintaining private test sets, BALSAM prevents models from achieving inflated scores through memorization of benchmark data encountered during training.

The 78 tasks cover the full spectrum of Arabic language capabilities: reading comprehension, question answering, summarization, translation, sentiment analysis, named entity recognition, text classification, grammatical error detection, and many others. This breadth ensures that model evaluation captures diverse capabilities rather than optimizing for a narrow task set.

Both closed-source and open-source models are evaluated, enabling fair comparison across the full Arabic AI landscape. The private test set methodology means that BALSAM scores are more trustworthy indicators of genuine model capability than public benchmarks where training data contamination is a documented concern.

Task Coverage and Design

BALSAM’s 78 tasks span the full spectrum of Arabic NLP capabilities. Reading comprehension tasks test whether models can extract specific information from Arabic passages, handle coreference resolution across pro-drop constructions (where Arabic omits explicit subjects), and draw inferences from culturally situated Arabic text. Question answering tasks evaluate knowledge retrieval across domains including science, history, geography, Islamic studies, and Arabic literature — testing both factual recall and reasoning ability.

Summarization tasks assess the ability to condense Arabic text while preserving key information, a capability complicated by Arabic’s morphological density (a single Arabic word often carries information that English distributes across multiple words). Translation tasks evaluate bidirectional Arabic-English translation quality, testing both fluency and accuracy in both directions. Sentiment analysis tasks measure the ability to detect sentiment in Arabic text, including dialectal text where sentiment markers differ from MSA conventions.

Named entity recognition tasks test the identification of persons, organizations, locations, and other entities in Arabic text — a task complicated by Arabic’s lack of capitalization (which English NER systems rely on as a primary signal), the ambiguity created by absent short vowels, and the diversity of transliteration conventions for foreign names in Arabic. Text classification tasks evaluate categorization accuracy across topics, genres, and registers.

Grammatical error detection and correction tasks test understanding of Arabic grammar at a level that requires genuine linguistic competence. Arabic grammatical rules — morphological agreement in gender, number, and case across verbs, subjects, and adjectives — create error patterns distinct from English grammatical errors. Models must internalize these rules to detect and correct errors accurately.

The Contamination Problem in Arabic AI

Benchmark contamination — models achieving inflated scores by memorizing benchmark data encountered during training — is a documented and growing concern in Arabic AI evaluation. The problem is particularly acute for Arabic because the total volume of Arabic evaluation data is much smaller than English evaluation data, making it statistically more likely that benchmark questions appear in training corpora assembled from web crawls.

Several factors amplify contamination risk for Arabic benchmarks. ArabicMMLU’s 14,575 questions, while substantial, are publicly available on Hugging Face and can be inadvertently included in training data through web crawls that index Hugging Face content. AraTrust’s 522 questions are similarly accessible. SILMA AI’s Arabic Broad Benchmark, comprising 470 questions from 64 publicly available Arabic datasets, draws from sources that training data pipelines routinely index.

BALSAM’s private test set methodology directly addresses this vulnerability. By maintaining test data that is not publicly accessible, BALSAM ensures that high scores reflect genuine model capability rather than data memorization. The benchmark’s design allows comparison between public and private test set performance for the same tasks — a score drop between public benchmark performance and BALSAM’s equivalent private evaluation indicates contamination benefit on the public benchmark.

Research has noted that many high-scoring models on the Open Arabic LLM Leaderboard achieve results through surface-level pattern recognition rather than true linguistic understanding. BALSAM provides the evaluation methodology needed to distinguish genuine capability from memorization, making it an essential complement to public benchmarks for organizations making deployment decisions based on evaluation scores.

Cross-Benchmark Validation

BALSAM’s value increases when results are compared against other Arabic evaluation frameworks. A model scoring consistently across ArabicMMLU (public), BALSAM (private), SILMA ABB (human-validated), and AraTrust (trustworthiness) demonstrates robust Arabic capability across evaluation methodologies. A model showing significant score variation across these benchmarks reveals capability patterns that any single benchmark would miss.

Falcon-H1 Arabic’s OALL-leading performance can be validated against BALSAM’s private evaluation to confirm that the hybrid Mamba-Transformer architecture’s advantages persist when contamination is controlled. Jais 2’s strong performance across public benchmarks can be stress-tested against BALSAM’s unseen tasks to verify that the 600B+ Arabic training tokens produce genuine knowledge rather than comprehensive coverage of public benchmark questions. ALLaM 34B’s sovereign data advantage — training on government content unavailable in public corpora — should manifest as strong BALSAM performance on tasks requiring institutional knowledge that publicly trained models lack.

Evaluation Infrastructure Maturity

BALSAM’s existence alongside ArabicMMLU, AraTrust, SILMA ABB, the OALL, and over 40 other Arabic benchmarks demonstrates the maturation of Arabic AI evaluation infrastructure. Five years ago, Arabic LLM evaluation relied primarily on machine-translated English benchmarks that failed to capture genuine Arabic capability. The current evaluation ecosystem — with native benchmarks covering knowledge, trustworthiness, cultural alignment, and contamination-resistant assessment — provides the multi-dimensional evaluation needed for responsible Arabic AI deployment.

The evaluation ecosystem now covers LLM performance, multimodality and vision, embedding quality, retrieval effectiveness, RAG generation accuracy, speech recognition, and OCR — spanning the full range of Arabic AI applications. Organizations deploying Arabic AI can evaluate candidate models across the specific capability dimensions relevant to their use case, using a benchmark battery that captures both capability and safety.

The 700+ models submitted to the OALL from more than 180 organizations confirm that this evaluation infrastructure serves a large and active community. As new Arabic AI models emerge — from Gulf state research institutions, academic collaborations, and international adaptations — BALSAM’s private test sets provide the contamination-resistant evaluation that maintains evaluation integrity as the community grows.

BALSAM’s Private Test Set Innovation

BALSAM’s most distinctive contribution to Arabic AI evaluation is the use of private test sets that prevent data contamination. Public benchmark datasets face a fundamental evaluation integrity challenge: once test questions become publicly available, they can appear in model training data — either intentionally or through web crawl inclusion. Models trained on contaminated data achieve artificially high scores that do not reflect genuine capability, misleading organizations that rely on benchmark scores for model selection.

BALSAM’s 52,000 samples across 78 tasks are divided into public development sets (for model tuning and validation) and private test sets (for official evaluation). Only the public sets are downloadable; private sets are evaluated through a submission system that prevents test data exposure. This architecture ensures that BALSAM scores reflect genuine model capability rather than memorization of previously encountered evaluation questions.

The contamination prevention design is particularly important for Arabic AI evaluation because the Arabic benchmark ecosystem is smaller than English equivalents. A model that memorizes the entirety of ArabicMMLU’s 14,575 questions achieves perfect scores on that benchmark while demonstrating zero generalization to novel Arabic questions. BALSAM’s private test sets prevent this memorization shortcut, providing evaluation results that organizations can trust for deployment decisions.

BALSAM’s 78 Tasks and Arabic Capability Coverage

BALSAM’s 78 tasks cover Arabic language capability across multiple dimensions: reading comprehension, question answering, text classification, sentiment analysis, named entity recognition, machine translation, summarization, and reasoning. This breadth ensures that BALSAM scores reflect comprehensive Arabic AI capability rather than specialization on a narrow task subset.

The task design reflects Arabic-specific evaluation considerations. Reading comprehension tasks use authentic Arabic text rather than translated passages, testing understanding of Arabic discourse patterns and cultural references. Question answering tasks require knowledge of Arabic history, geography, culture, and current affairs. Text classification tasks evaluate performance on Arabic genre, topic, and sentiment categories. Named entity recognition tasks test entity extraction from Arabic text with its morphological complexity.

For the three leading Arabic LLMs — Jais 2 (70B parameters, 600B+ Arabic tokens), ALLaM 34B (sovereign Saudi training data), and Falcon-H1 Arabic (hybrid Mamba-Transformer, 75.36% OALL) — BALSAM provides evaluation rigor that public benchmarks cannot match. Models with extensively curated training data tend to show smaller score drops between BALSAM and public benchmarks, confirming that training data quality investments compound across evaluation scenarios.

The $858 million in MENA AI VC during 2025 reflects investor confidence in Arabic AI companies — confidence that standardized evaluation through BALSAM and other benchmarks helps justify. When startups can demonstrate their Arabic AI products’ performance on contamination-resistant benchmarks, investors gain confidence that claimed capabilities reflect genuine model quality rather than benchmark gaming.

BALSAM’s Position in the Arabic Evaluation Ecosystem

BASLAM’s position within the broader Arabic evaluation ecosystem is complementary rather than competitive. ArabicMMLU provides knowledge evaluation, AraTrust provides trustworthiness evaluation, SILMA AI’s ABB provides breadth evaluation, and BALSAM provides contamination-resistant evaluation. Organizations conducting comprehensive Arabic LLM assessment use multiple benchmarks, with BALSAM’s private test sets providing the integrity assurance that validates scores achieved on public benchmarks.

The relationship between public and private benchmark performance reveals model quality dimensions that individual benchmarks cannot capture. Models that achieve high scores on public benchmarks but significantly lower scores on BALSAM’s private tests demonstrate that their public benchmark performance reflects data contamination rather than genuine capability. Conversely, models with consistent performance across public and private evaluations demonstrate robust Arabic language capability that transfers to novel evaluation scenarios.

For the Arabic AI ecosystem — $858 million in MENA AI VC, 664 Saudi AI companies, three competing Arabic LLM platforms — BALSAM’s contamination resistance provides the evaluation integrity that commercial deployments require. Organizations cannot risk deploying Arabic AI systems whose benchmark performance does not predict production performance — BALSAM’s private test sets provide the assurance that public benchmarks alone cannot offer.

Task Design and Arabic Linguistic Coverage

BALSAM’s 78 tasks are organized to cover Arabic linguistic phenomena that generic multilingual benchmarks overlook. Tasks involving morphological analysis test whether models understand Arabic’s root-pattern system — evaluating whether a model correctly handles the distinction between “kataba” (he wrote) and “kutiba” (it was written) in context. Diacritization tasks evaluate whether models can restore missing vowel marks to ambiguous Arabic text. Dialectal understanding tasks test performance across MSA and regional varieties, exposing the dialect gap that aggregate scores conceal.

The 52,000 samples provide statistical power for fine-grained performance analysis. Rather than reporting a single aggregate score, BALSAM enables per-task, per-domain, and per-difficulty breakdowns that reveal specific Arabic capabilities and weaknesses. A model might excel at Arabic sentiment analysis while struggling with Arabic grammatical error detection — insight that BALSAM’s task granularity exposes but aggregate benchmarks like ArabicMMLU cannot provide.

Al-Matham et al. (2025) designed BALSAM’s private test methodology to address the contamination problem systematically. Each of the 78 tasks maintains a public development set for model development and a private test set for official evaluation. The public sets enable researchers to develop and debug their approaches. The private sets — never released publicly — ensure that official scores reflect genuine model capability rather than training data memorization. This public-private split provides both usability (researchers can iterate on their approaches) and integrity (official scores resist gaming).

Private Test Set Methodology

BALSAM’s private test set methodology addresses a growing concern across AI evaluation: benchmark contamination. As training corpora expand to include ever-larger portions of publicly available text, the probability that benchmark questions appear in training data increases. Models that have seen evaluation questions during training achieve artificially inflated scores that do not predict real-world performance. BALSAM’s response — maintaining test sets that are never publicly released — provides the only reliable defense against contamination for Arabic AI evaluation. The dual public-private approach (public development sets for research, private test sets for official scoring) balances accessibility with integrity.

Arabic AI Benchmarks — Full benchmark coverage
Arabic LLMs — Model performance context
OALL Benchmark Analysis — Leaderboard methodology
ArabicMMLU Results — Public benchmark complement
AraTrust Evaluation — Trustworthiness assessment
SILMA Arabic Broad Benchmark — Human-validated evaluation
Arabic AI Datasets — Training and evaluation data
Arabic LLM Training Data — Training corpus comparison

Arabic BenchmarksEvaluationArabic AI