SILMA Arabic Broad Benchmark — 470-Question Quality Assessment Framework

The Arabic Broad Benchmark from SILMA.AI provides a unique evaluation approach that combines human validation with LLM-as-Judge scoring methodology. The benchmark comprises 470 high-quality questions sampled from 64 Arabic benchmarking datasets, covering 22 categories and skills. The sampling methodology ensures representation across the breadth of Arabic AI capabilities while maintaining a manageable evaluation size.

The advanced benchmarking script utilizes a mix of 20+ manual rules and LLM-as-Judge variations customized for each specific skill being assessed. This hybrid evaluation approach combines the reliability of rule-based assessment for objective tasks with the nuanced judgment that LLM-based evaluation provides for subjective quality dimensions like fluency, cultural appropriateness, and response helpfulness.

SILMA.AI has also published a comprehensive community article cataloging all benchmarks and leaderboards within the Arabic AI ecosystem, providing a centralized resource covering LLM performance, multimodality, vision, embedding, retrieval, RAG generation, SST, and OCR evaluation frameworks. The catalog documents over 40 distinct Arabic benchmarks, confirming the evaluation ecosystem’s rapid maturation from a handful of translated assessments to comprehensive native Arabic evaluation infrastructure.

22-Category Evaluation Framework

The Arabic Broad Benchmark’s 22 categories span the breadth of Arabic AI capabilities. Language understanding categories evaluate comprehension of MSA, classical Arabic, and dialectal text. Knowledge categories test factual recall across Islamic studies, Arab history, Arabic literature, geography, and science — domains where Arabic-specific knowledge cannot be transferred from English training data. Reasoning categories assess logical inference, mathematical problem-solving, and analytical thinking presented in Arabic.

Generation categories evaluate the quality of Arabic text production across registers — formal MSA document generation, conversational dialectal responses, and creative Arabic writing. Cultural categories assess understanding of Arabic social norms, religious concepts, and regional customs. Safety categories evaluate trustworthiness dimensions parallel to AraTrust’s framework — truthfulness, ethics, and offensive content detection in Arabic contexts.

The 470 questions, sampled from 64 distinct Arabic datasets, provide breadth that any single source cannot match. The sampling methodology ensures representation across the full capability spectrum while maintaining a manageable evaluation size that enables rapid assessment of new models. Each question has been human-validated by Arabic-speaking evaluators, confirming that the questions are linguistically correct, culturally appropriate, and have unambiguous correct answers.

LLM-as-Judge Methodology

SILMA ABB’s hybrid evaluation approach — combining manual rules with LLM-as-Judge variations — addresses the limitations of both purely automated and purely human evaluation. For objective tasks with clear correct answers (factual questions, mathematical problems), manual rules provide reliable, reproducible scoring. For subjective quality dimensions (fluency, cultural appropriateness, response helpfulness), LLM-as-Judge provides nuanced evaluation that captures quality distinctions that simple rules miss.

The LLM-as-Judge variations customized for each specific skill enable skill-appropriate evaluation criteria. A fluency evaluation uses different criteria than a factual accuracy evaluation, which differs from a cultural sensitivity evaluation. By tailoring the judge model’s evaluation prompt to each skill category, SILMA ABB achieves evaluation specificity that generic evaluation prompts cannot match.

This methodology introduces a dependency on the judge model’s Arabic capability. If the LLM judge has weak Arabic understanding, its evaluations may be unreliable — particularly for dialectal text, culturally nuanced content, and domain-specific Arabic knowledge. SILMA ABB mitigates this risk through the 20+ manual rules that provide ground-truth anchoring independent of the judge model’s quality.

SILMA Model Ecosystem

SILMA.AI develops adapted Arabic models alongside its benchmarking work, creating a feedback loop between model development and evaluation. SILMA’s models use continued pretraining and fine-tuning approaches to extend existing architectures with Arabic capability — positioning them in the adapted model category alongside AceGPT, distinct from Arabic-native models like Jais, ALLaM, and Falcon Arabic.

The distinction between adapted and native Arabic models is methodologically significant for benchmarking. Adapted models inherit their base model’s strengths (reasoning capability, world knowledge from English training) while adding Arabic competence through additional training. Native models build Arabic capability from the ground up, optimizing tokenization, attention patterns, and knowledge representation for Arabic. SILMA ABB evaluates both categories on the same questions, enabling direct comparison that reveals where adapted models’ inherited capabilities compensate for their Arabic-specific limitations and where native models’ purpose-built designs provide measurable advantages.

Cross-Benchmark Ecosystem Integration

SILMA ABB occupies a specific niche within the Arabic evaluation ecosystem. ArabicMMLU provides deep evaluation of academic knowledge through 14,575 questions. AraTrust focuses specifically on trustworthiness across eight safety dimensions. BALSAM provides contamination-resistant evaluation through 78 tasks with private test sets. The OALL aggregates multiple benchmarks into composite scores for leaderboard ranking. SILMA ABB complements these through its breadth (22 categories), human validation quality, and hybrid evaluation methodology.

The integration pattern for comprehensive Arabic model evaluation uses all available benchmarks: OALL for competitive ranking, ArabicMMLU for academic knowledge depth, AraTrust for safety assessment, BALSAM for contamination-resistant validation, and SILMA ABB for broad skill coverage with human-validated quality. Organizations deploying Arabic AI should evaluate candidate models across this full benchmark battery, weighting results according to their application’s specific capability and safety requirements.

The ecosystem’s coverage extends beyond text-based LLM evaluation. Arabic benchmarks now exist for multimodality (Arabic image understanding), vision (Arabic OCR evaluation), embedding (Arabic MTEB), retrieval (Arabic information retrieval benchmarks), RAG generation (ALRAGE), speech (Open Universal Arabic ASR Leaderboard), and document understanding. SILMA.AI’s comprehensive catalog of these benchmarks provides a centralized reference for organizations navigating the expanding Arabic evaluation landscape.

Implications for Model Selection

SILMA ABB’s 22-category scores enable granular model selection based on specific capability requirements. An organization deploying Arabic AI for customer service should prioritize categories evaluating conversational quality, dialect understanding, and response helpfulness. An organization deploying Arabic AI for document analysis should prioritize comprehension, summarization, and factual accuracy categories. An organization deploying Arabic AI for educational applications should prioritize knowledge, reasoning, and Arabic language understanding categories.

Models showing even performance across all 22 SILMA ABB categories demonstrate broad Arabic capability suitable for general-purpose deployment. Models showing peaked performance on specific categories with weakness elsewhere are better suited for specialized applications aligned with their strengths. This diagnostic capability — identifying not just overall quality but specific strengths and weaknesses — makes SILMA ABB a valuable tool for deployment planning alongside the more widely referenced OALL composite scores.

SILMA AI’s Evaluation Methodology

SILMA AI’s Arabic Broad Benchmark distinguishes itself through its evaluation methodology as much as its question content. Each of the 22 evaluated skills is assessed using over 20 manual rules combined with LLM-as-Judge variations — creating a multi-perspective evaluation that captures capability nuances that single-metric evaluation misses.

The manual rules provide deterministic evaluation criteria: specific patterns in model output that indicate correct or incorrect handling of Arabic linguistic features. An Arabic grammar evaluation rule might check whether the model correctly applies subject-verb agreement in Arabic’s VSO word order. A cultural knowledge rule might verify that the model provides accurate information about Islamic practices or Arab cultural conventions. These deterministic rules provide reproducible evaluation that does not vary across evaluation runs.

The LLM-as-Judge component complements manual rules with evaluation perspectives that cannot be reduced to deterministic patterns. Language quality assessment — whether generated Arabic reads naturally or exhibits translation artifacts — requires the kind of holistic judgment that LLM evaluators provide. Cultural appropriateness assessment — whether model outputs align with Arabic cultural norms — similarly benefits from the contextual understanding that LLM judges bring to evaluation.

The combination of manual rules and LLM judges provides evaluation robustness that either approach alone cannot achieve. Manual rules prevent the inconsistency that plagues pure LLM-as-Judge evaluation. LLM judges capture the nuanced quality dimensions that manual rules cannot encode. This hybrid evaluation methodology — applied across 470 human-validated questions from 64 Arabic datasets — produces the most comprehensive single-benchmark assessment of Arabic AI capability currently available.

ABB’s 22 Skill Categories

The 22 skill categories in SILMA AI’s Arabic Broad Benchmark span the full range of Arabic language tasks: reading comprehension, question answering, summarization, text generation, translation, grammar, morphology, sentiment analysis, named entity recognition, topic classification, dialogue generation, cultural knowledge, mathematical reasoning, logical inference, code generation, and more. This breadth tests models across diverse Arabic capability dimensions rather than optimizing for a narrow task set.

The 64 source datasets underlying the 470 evaluation questions provide dataset diversity that prevents models from achieving high scores through specialization on a single data distribution. Questions drawn from educational exams, news articles, academic papers, social media text, and cultural content test Arabic capability across the register spectrum — from formal MSA to informal dialectal varieties. This register diversity is essential for evaluating models intended for deployment across diverse Arabic use cases.

For the Arabic AI ecosystem — including the $858 million in MENA AI VC, 664 Saudi AI companies, and the three competing Arabic LLM platforms — ABB provides evaluation infrastructure that enables meaningful model comparison across the breadth of Arabic language capability. Model developers use ABB results to identify specific capability gaps (e.g., weak mathematical reasoning despite strong reading comprehension) and direct training improvements accordingly.

ABB’s Contribution to Arabic AI Standardization

SILMA AI’s Arabic Broad Benchmark contributes to the standardization of Arabic AI evaluation by establishing comprehensive skill coverage that individual benchmarks cannot achieve. While ArabicMMLU focuses on knowledge, AraTrust on trustworthiness, and BALSAM on contamination resistance, ABB evaluates across 22 skill categories that collectively capture the breadth of Arabic language capability needed for production deployment.

The 64 source datasets underlying ABB’s 470 questions provide dataset diversity that prevents models from gaming evaluation through specialization. Each dataset contributes questions from a different Arabic text source, register, or domain — ensuring that high ABB scores reflect genuine Arabic capability across diverse contexts rather than optimization for a specific data distribution.

ABB’s LLM-as-Judge evaluation methodology contributes innovation beyond the Arabic AI context. The combination of 20+ manual rules per skill with LLM-as-Judge variations creates an evaluation approach that balances deterministic reproducibility with nuanced quality assessment. This methodology — applicable to any language’s AI evaluation — demonstrates that Arabic AI research contributes evaluation innovation alongside model development to the global AI community.

For the MENA AI market, ABB provides the comprehensive skill assessment that enterprise customers need for deployment decisions. A model scoring well on knowledge benchmarks but poorly on ABB’s dialogue generation or code-related categories may not suit customer service or developer tool applications. ABB’s breadth enables application-specific model selection that narrower benchmarks cannot support.

22 Skill Categories in Detail

ABB’s 22 categories span the complete Arabic AI capability spectrum. Language understanding categories evaluate reading comprehension, text entailment, and semantic similarity for Arabic text. Knowledge categories evaluate factual recall across Arabic-relevant domains including Islamic studies, Arab history, and science. Reasoning categories evaluate logical inference, mathematical computation, and causal reasoning in Arabic. Generation categories evaluate text production quality including summarization, paraphrase, dialogue, and creative writing. Safety categories evaluate harmful content detection, bias identification, and cultural sensitivity — dimensions where cultural norms vary across 22 Arabic-speaking countries.

The 64 source datasets underlying the 470 evaluation questions ensure that no single data source dominates the evaluation, preventing models from gaming through familiarity with a specific dataset’s distribution. SILMA.AI’s human validation of all 470 questions adds quality assurance that distinguishes ABB from automatically generated benchmarks where question quality is inconsistent.

ABB in Enterprise Model Selection

For enterprise customers evaluating Arabic LLMs, ABB’s 22-category granularity enables application-specific model selection that aggregate benchmarks cannot support. A customer service deployment weights dialogue generation, cultural sensitivity, and Arabic language understanding categories. A document processing deployment weights reading comprehension, summarization, and knowledge recall categories. A financial analysis deployment weights reasoning, mathematical computation, and Arabic terminology categories. ABB’s category-level scores enable these weighted assessments that match model strengths to application requirements.

Arabic AI Benchmarks — Full benchmark coverage
Arabic LLMs — Model performance context
OALL Benchmark Analysis — Leaderboard methodology
ArabicMMLU Results — Academic knowledge evaluation
AraTrust Evaluation — Trustworthiness assessment
BALSAM Benchmark — Contamination-resistant evaluation
Arabic AI Datasets — Training and evaluation data
AceGPT — Cultural Alignment — Adapted model methodology

Arabic BenchmarksEvaluationArabic AI