AraTrust Evaluation — Trustworthiness Assessment of Arabic Language Models

AraTrust establishes the first systematic framework for evaluating the trustworthiness of Arabic language models across dimensions that matter for responsible deployment. Published at COLING 2025 in Abu Dhabi, the benchmark comprises 522 human-written multiple-choice questions addressing truthfulness, ethics, privacy, illegal activities, mental health, physical health, unfairness, and offensive language.

The results reveal concerning patterns. GPT-4 achieved the highest trustworthiness score, demonstrating that proprietary models with extensive safety training maintain their advantage in Arabic trustworthiness despite weaker Arabic language performance. Among open-source Arabic models, AceGPT 7B and Jais 13B struggled to achieve scores of 60 percent — a finding that underscores the gap between linguistic competence and trustworthy behavior in Arabic AI.

AraTrust’s significance extends beyond individual model evaluation. The benchmark provides a framework for systematic trustworthiness improvement through targeted training. By identifying specific dimensions where Arabic models score poorly — privacy awareness, mental health sensitivity, unfairness detection — developers can construct focused training datasets that address these gaps without compromising overall language performance.

Eight Trustworthiness Dimensions

Truthfulness evaluates whether models generate factually accurate information and avoid fabricating content. For Arabic LLMs, truthfulness testing includes Arabic-specific knowledge domains — Islamic history, Arabic literary traditions, regional geography — where training data quality directly determines factual accuracy. Models with smaller Arabic training corpora are more likely to confabulate Arabic-specific facts, generating plausible-sounding but incorrect information about Arabic cultural or historical topics.

Ethics assesses model behavior when confronted with ethically complex scenarios. The ethical dimension for Arabic AI carries cultural specificity: ethical norms in Arabic-speaking societies reflect Islamic jurisprudence, regional legal frameworks, and social customs that differ from the predominantly Western ethical frameworks encoded in multilingual models. A model trained primarily on English ethical reasoning may apply inappropriate ethical frameworks to Arabic-context dilemmas.

Privacy tests whether models respect privacy boundaries in Arabic-language interactions. Saudi Arabia’s Personal Data Protection Law (PDPL), UAE data governance regulations, and similar frameworks across the Gulf establish legal requirements for privacy protection that Arabic AI systems must respect. Models that inadvertently reveal training data containing personal information, or that assist users in privacy-violating activities, score poorly on this dimension.

Illegal activities evaluates whether models refuse requests to assist with illegal actions under Arabic-speaking countries’ legal systems. The diversity of legal systems across 22 Arabic-speaking countries — from Saudi Arabia’s Sharia-based legal framework to Egypt’s civil law system to Lebanon’s confessional legal structure — makes this dimension particularly complex. A model must navigate legal variation without assuming a single legal framework applies universally.

Mental health and physical health dimensions assess model safety when users express distress or seek health information. Arabic stigma around mental health creates situations where users may approach AI systems with concerns they are unwilling to discuss with human professionals. Model behavior in these interactions — providing appropriate resources, avoiding harmful advice, recognizing crisis indicators — carries safety implications that AraTrust explicitly evaluates.

Unfairness tests for bias against protected groups, including gender, religious, ethnic, and national origin bias. Arabic social contexts include specific bias risks — sectarian bias, tribal prejudice, nationality-based discrimination — that differ from the bias categories typically evaluated in English-language safety benchmarks.

Offensive language evaluates whether models generate content that is offensive in Arabic cultural contexts. What constitutes offensive language varies across Arabic-speaking societies — content acceptable in one cultural context may be deeply offensive in another. AraTrust’s evaluation covers this cultural variation, testing whether models can navigate the sensitivity spectrum across Arabic-speaking societies.

Model Performance Patterns

GPT-4’s top AraTrust ranking demonstrates that massive investment in safety training — OpenAI’s RLHF and red-teaming processes — transfers across languages. Despite GPT-4’s Arabic training data representing a fraction of its English training, the model’s safety behaviors generalize to Arabic contexts sufficiently to outperform Arabic-first models on trustworthiness.

The sub-60 percent scores of AceGPT 7B and Jais 13B on AraTrust highlight a systematic challenge: open-source Arabic models have invested heavily in language capability while underinvesting in safety and alignment. AceGPT’s RLAIF methodology pioneered cultural alignment for Arabic, but the reward model’s training data may not have sufficiently covered safety-critical scenarios. Jais 13B, as an earlier release in the Jais series, lacked the comprehensive safety framework that Jais 2 subsequently developed with input from Arabic-speaking experts across multiple countries.

Larger model sizes generally correlate with better AraTrust scores, reflecting the additional capacity available for encoding safety knowledge alongside language capability. The 70B Jais 2 and 34B ALLaM — both developed with explicit safety frameworks — are expected to score significantly higher than their smaller predecessors, though comprehensive AraTrust evaluations of these newer models await publication.

Integration with OALL v2 and Broader Evaluation

AraTrust serves as one of four OALL v2 benchmarks alongside ArabicMMLU, ALRAGE, and MadinahQA. Its inclusion in the composite OALL score means that models cannot achieve top leaderboard positions without demonstrating trustworthiness — a design decision that incentivizes safety investment across the Arabic AI community.

The interaction between AraTrust and ArabicMMLU scores provides diagnostic insight. Models with high ArabicMMLU and low AraTrust scores are knowledgeable but potentially dangerous — capable of providing accurate Arabic information while lacking the safety boundaries that prevent harmful applications. Models with high AraTrust and moderate ArabicMMLU scores are safe but limited — useful for low-risk applications but insufficient for knowledge-intensive tasks.

BALSAM’s private test sets complement AraTrust by preventing safety benchmark contamination. Models that achieve high AraTrust scores by memorizing specific safety questions during training — rather than internalizing genuine safety reasoning — would score lower on BALSAM’s private safety evaluation tasks. This cross-validation between public and private benchmarks strengthens confidence in safety evaluations.

Implications for Arabic AI Regulation

AraTrust’s systematic evaluation framework has implications for emerging Arabic AI regulation. Saudi Arabia’s SDAIA has established AI governance frameworks. The UAE’s AI governance guidelines recommend transparency and accountability in AI systems. AraTrust provides a concrete evaluation methodology that regulators can reference when establishing trustworthiness requirements for Arabic AI deployment.

Organizations deploying Arabic AI in customer-facing, healthcare, government, or financial services applications should evaluate their models against AraTrust’s eight dimensions before production deployment. A model scoring below 70 percent on any individual AraTrust dimension merits additional safety mitigation — guardrails, output filtering, human review — before deployment in that dimension’s risk category.

The benchmark’s publication at COLING 2025 in Abu Dhabi positions it within the Arabic NLP research community’s institutional framework, ensuring continued development, validation, and expansion. Future AraTrust versions may add evaluation dimensions specific to emerging Arabic AI applications — deepfake detection, synthetic content identification, and multi-modal safety — as the Arabic AI capability frontier expands.

AraTrust’s Eight Trustworthiness Dimensions

AraTrust’s evaluation framework defines trustworthiness through eight specific dimensions, each addressed by human-written evaluation questions designed to probe model behavior on sensitive Arabic content.

Truthfulness evaluates whether models provide factually accurate information in Arabic, detecting hallucination patterns where models generate plausible but incorrect Arabic text. Ethics assesses whether model outputs align with ethical principles relevant to Arabic-speaking societies — including principles derived from Islamic ethics that Western AI evaluation frameworks do not address. Privacy tests whether models respect personal data boundaries in Arabic conversations, refusing to generate or reveal personal information when prompted. Illegal activities evaluates whether models refuse to provide information that could facilitate illegal activities under Arabic legal systems.

Mental health assesses model behavior when users express psychological distress in Arabic, testing whether models provide appropriate supportive responses rather than harmful advice. Physical health evaluates health-related advice in Arabic medical contexts, where incorrect health information could lead to physical harm. Unfairness tests for bias in model outputs across Arabic demographic categories — including gender, nationality, sect, and social class dimensions specific to Arabic-speaking societies. Offensive language evaluates whether models generate or respond appropriately to offensive content in Arabic, including culturally offensive material that Western toxicity classifiers would not flag.

AraTrust Results and Model Comparative Analysis

AraTrust evaluation revealed significant performance variation across Arabic LLMs. GPT-4 achieved the highest overall trustworthiness score, demonstrating that massive investment in safety and alignment can achieve results that smaller Arabic-specific models have not yet matched. However, GPT-4’s trustworthiness advantage must be weighed against its limitations for Arabic deployment: proprietary access only, limited dialect support, cultural alignment derived from English-language safety training rather than Arabic-specific evaluation.

Among open-source Arabic models, trustworthiness performance varied dramatically. AceGPT 7B and earlier Jais 13B scored below 60 percent on AraTrust, revealing that accuracy and trustworthiness are distinct capabilities that do not necessarily co-occur. A model can answer Arabic knowledge questions accurately while generating culturally inappropriate, ethically problematic, or privacy-violating content. This finding has influenced subsequent model development: Jais 2’s comprehensive safety framework, ALLaM 34B’s expert-validated cultural alignment, and Falcon-H1 Arabic’s safety refinement across three Falcon generations all address the trustworthiness gaps that AraTrust identified.

The OALL’s inclusion of AraTrust as a version 2 benchmark ensures that trustworthiness evaluation is not optional for Arabic LLMs seeking competitive positioning. The 700+ models evaluated against AraTrust through the OALL create a comprehensive trustworthiness database that tracks the Arabic AI ecosystem’s progress on safety and alignment. For the $858 million in MENA AI VC and the 664 AI companies in Saudi Arabia, AraTrust provides assurance that the foundation models these investments build upon meet minimum trustworthiness standards.

AraTrust’s Influence on Arabic LLM Development Practices

AraTrust’s publication of trustworthiness evaluation results has directly influenced Arabic LLM development practices. The finding that accuracy and trustworthiness are distinct capabilities — models can answer Arabic questions correctly while generating culturally inappropriate content — prompted development teams to invest in safety frameworks alongside capability improvements.

Jais 2’s development explicitly incorporated trustworthiness evaluation informed by AraTrust’s dimensions. The safety framework was developed in consultation with Arabic-speaking experts from multiple countries, addressing the cultural diversity that AraTrust’s evaluation highlights. ALLaM 34B’s engagement of 400 subject matter experts for model testing included trustworthiness evaluation alongside accuracy assessment, reflecting AraTrust’s influence on development methodology. Falcon-H1 Arabic’s safety framework, refined across three Falcon generations, demonstrates institutional learning about Arabic AI trustworthiness that AraTrust’s evaluation framework catalyzed.

The industry-wide impact extends beyond the three major Arabic LLMs. The OALL’s inclusion of AraTrust as a version 2 benchmark means that all 700+ submitted models are evaluated against trustworthiness criteria. This universal evaluation creates competitive incentive for trustworthiness improvement across the entire Arabic AI ecosystem, benefiting Arabic speakers who interact with Arabic AI systems regardless of which specific model powers their experience.

Eight Trustworthiness Dimensions Explained

AraTrust’s eight dimensions provide comprehensive Arabic AI trustworthiness coverage. Truthfulness evaluates factual accuracy, testing for hallucination in Arabic-specific knowledge domains. Ethics evaluates alignment with ethical principles in Arabic cultural contexts. Privacy tests information boundary respect and compliance with regulations like Saudi PDPL. Illegal activities evaluates refusal to assist with illegal actions. Mental health evaluates safe responses to Arabic mental health queries. Physical health evaluates medical information accuracy with appropriate disclaimers. Unfairness evaluates biases along gender, ethnic, national, or sectarian lines in Arabic contexts. Offensive language evaluates avoidance of culturally offensive content — particularly challenging because offensiveness varies across Arabic-speaking countries.

The 522 human-written MCQ format enables standardized, reproducible evaluation. Each question presents a scenario with multiple response options, with the correct answer representing the most trustworthy response. Automated scoring at scale maintains nuance that human-written questions provide over machine-generated alternatives, making AraTrust the definitive Arabic AI safety benchmark adopted as a core component of OALL v2.

AraTrust and Arabic AI Governance

AraTrust contributes to the emerging field of Arabic AI governance by establishing measurable trustworthiness standards that regulators, enterprise customers, and civil society can reference. Saudi Arabia’s PDPL and the UAE’s data protection frameworks establish legal requirements for AI systems, but these legal frameworks need technical evaluation tools to verify compliance. AraTrust provides the technical evaluation methodology that translates legal trustworthiness requirements into measurable model assessments, bridging the gap between regulatory intent and technical verification.

Arabic AI Benchmarks — Full benchmark coverage
Arabic LLMs — Model performance context
OALL Benchmark Analysis — Leaderboard methodology
ArabicMMLU Results — Knowledge evaluation complement
AceGPT — Cultural Alignment — RLAIF methodology context
Jais — Arabic LLM — Safety framework evolution
SDAIA Strategy — AI governance frameworks
RLHF and RLAIF — Alignment methodology reference

Arabic BenchmarksEvaluationArabic AI