Arabic AI Datasets — Training and Evaluation Data Resources for Arabic AI Development

The availability and quality of Arabic AI datasets determine the ceiling of Arabic model performance. This survey catalogs the major public datasets available for Arabic AI development, evaluating their coverage, quality, and suitability for different applications.

Training Corpora

Arabic training data for LLM development draws from multiple source categories. Web crawl data provides the largest volumes but requires extensive quality filtering. News corpora (OSIAN, AraNews) provide high-quality MSA but limited dialectal coverage. The GUMAR Corpus from CAMeL Lab contains 100 million words of Gulf Arabic. The MADAR Corpus provides parallel sentences across 25 city dialects. The CamelTB Treebank offers 188,000 words of syntactically annotated Arabic.

Post-Training Datasets

Arabic post-training datasets span 12 task categories mapped to four dimensions: LLM Capabilities (Q&A, translation, reasoning, summarization, dialogue, code generation, function calling), Steerability, Alignment (cultural alignment, safety, ethics), and Robustness. Research has highlighted limitations in current Arabic post-training datasets, including limited dialectal representation, insufficient coverage of safety-critical scenarios, and quality inconsistencies.

Synthetic Data

Synthetic data generation for Arabic AI is an emerging approach that addresses the scarcity of labeled Arabic data. Researchers have generated datasets by prompting GPT-4 directly in Arabic, producing 43,316 multi-turn conversations across 93 topics. The quality of synthetic Arabic data depends critically on the generating model’s Arabic competence and the prompting strategy’s cultural awareness.

AceGPT’s training methodology illustrates the role of synthetic data in Arabic LLM development. The model’s supervised fine-tuning stage used GPT-4-generated responses to Arabic instructions, creating high-quality Arabic instruction-following examples at scale. This synthetic approach bypasses the scarcity of manually annotated Arabic instruction data while leveraging GPT-4’s language generation capability. However, synthetic Arabic data carries risks: the generating model’s potential cultural biases, English-mediated artifacts in Arabic generation, and coverage gaps in dialectal Arabic where GPT-4’s own training data is limited.

Dialect-Specific Datasets

Dialect-specific datasets address the critical gap between MSA-dominated corpora and the dialectal Arabic that characterizes real-world communication. The GUMAR corpus from CAMeL Lab provides 100 million words of Gulf Arabic, enabling focused evaluation and training for Gulf dialect applications. The MADAR corpus offers parallel sentences across 25 city dialects plus English, French, and MSA, making it the most geographically diverse Arabic parallel corpus available.

The SADA (Saudi Audio Dataset for Arabic) corpus provides 668 hours of Arabic speech from Saudi television shows, covering multiple dialects and acoustic environments. The best model trained on SADA — MMS 1B fine-tuned with a 4-gram language model — achieves 40.9 percent WER and 17.6 percent CER, establishing baselines for Saudi-dialect speech recognition.

The NADI (Nuanced Arabic Dialect Identification) shared task series provides standardized evaluation for dialect identification systems. NADI tasks range from country-level dialect identification to fine-grained city-level classification, pushing the boundaries of how precisely NLP systems can identify Arabic dialectal variation from text alone.

Mozilla Common Voice’s Arabic component provides crowd-sourced speech data covering various Arabic dialects, serving as training data for fine-tuned Whisper and other ASR models. The crowd-sourced nature introduces dialect diversity that studio-recorded corpora lack, though at the cost of variable recording quality.

Morphological and Linguistic Datasets

Arabic morphological datasets provide the linguistic annotations essential for NLP tool development and evaluation. The CaMeL Treebank, spanning 188,000 words from pre-Islamic poetry to social media, provides dependency-parsed Arabic text annotated according to the CATiB annotation scheme. This temporal breadth enables linguistic analysis across Arabic’s full historical range.

The QALB (Qatar Arabic Language Bank) corpus contains 2 million manually corrected Arabic words, providing gold-standard error correction annotations for Arabic grammatical error detection and correction systems. The SAMER lexicon offers 26,000 lemmas annotated for readability in MSA, supporting text complexity assessment for educational applications.

CALIMA Star and BAMA/SAMA provide morphological analysis databases that enumerate the legal morphological analyses for Arabic word forms. These databases underpin morphological analysis tools like MADAMIRA and CAMeL Tools, enabling the 300,000+ POS tag disambiguation that Arabic morphological analysis requires.

Evaluation Benchmark Datasets

The Arabic evaluation landscape encompasses over 40 distinct benchmarks across multiple AI capabilities. ArabicMMLU’s 14,575 questions from educational exams provide academic knowledge evaluation. AraTrust’s 522 human-written questions assess trustworthiness across eight dimensions. BALSAM’s 78 tasks with 52,000 samples and private test sets prevent contamination. SILMA AI’s Arabic Broad Benchmark covers 22 categories with 470 human-validated questions from 64 Arabic datasets.

The ACVA benchmark, introduced by the AceGPT research team, evaluates Arabic cultural and value alignment — a dimension absent from translated benchmarks. The AceGPT benchmark suite comprising 58 datasets was adopted by the Open Arabic LLM Leaderboard, establishing community-standard evaluation tools. MadinahQA evaluates Islamic and cultural knowledge, testing model understanding of content deeply rooted in Arabic-Islamic intellectual tradition.

The Open Universal Arabic ASR Leaderboard evaluates speech recognition models on Arabic data, with top performers including Nvidia Conformer-CTC-Large, Whisper Large variants, and seamless-m4t. The Arabic MTEB benchmark evaluates embedding models across retrieval, semantic similarity, classification, clustering, and other tasks, providing criteria for selecting embedding models appropriate for Arabic RAG applications.

Data Quality and Contamination Concerns

Data quality remains the most significant challenge in Arabic AI dataset development. The distinction between native Arabic content and machine-translated material has measurable impact on model quality that benchmarks originally failed to capture. The OALL’s transition from v1 (including machine-translated tasks) to v2 (native Arabic only) was driven by the recognition that translated evaluation systematically inflated scores for models trained on translated content.

Benchmark contamination — where models achieve high scores through memorization of benchmark data encountered during training — is a documented concern for public Arabic benchmarks. BALSAM’s private test set methodology addresses this directly, but most Arabic evaluation datasets remain publicly available on Hugging Face and other platforms. Research has noted that many high-scoring models on the OALL achieve results through surface-level pattern recognition rather than true linguistic understanding — a finding that underscores the importance of multi-benchmark evaluation using diverse assessment methodologies.

The scarcity of labeled Arabic data relative to English creates pressure to expand datasets through annotation, synthetic generation, and cross-lingual transfer. Each expansion approach introduces quality tradeoffs: human annotation is expensive and slow but produces the highest quality; synthetic generation is scalable but introduces model biases; cross-lingual transfer preserves source language artifacts. Organizations building Arabic AI systems must evaluate these tradeoffs against their specific quality requirements and resource constraints.

Core Arabic NLP Datasets and Resources

The Arabic AI dataset ecosystem encompasses training datasets, evaluation benchmarks, and linguistic resources that collectively support Arabic language model development and evaluation. Training datasets provide the raw material for model learning. Evaluation benchmarks measure model performance. Linguistic resources (morphological analyzers, treebanks, lexicons) provide the structured linguistic knowledge that complements statistical learning.

Major Arabic training datasets include the Arabic components of Common Crawl and OSCAR (web-scale Arabic text), the Arabic Gigaword corpus (news text from multiple Arabic sources), and specialized corpora like GUMAR (100 million words of Gulf Arabic) and the MADAR corpus (parallel sentences in 25 city dialects). These resources serve as building blocks for Arabic LLM training — Jais 2’s 600+ billion Arabic tokens, ALLaM’s 500 billion Arabic tokens, and Falcon Arabic’s 600 giga-tokens all draw from combinations of these sources, supplemented by proprietary data collection.

Evaluation Dataset Proliferation

The Arabic evaluation dataset landscape has expanded from a handful of benchmarks to over 40 distinct evaluations covering LLM performance, multimodality, embedding, retrieval, RAG generation, speech, and OCR. This proliferation reflects the Arabic AI community’s recognition that single-benchmark evaluation cannot capture the multi-dimensional nature of Arabic language capability.

Key evaluation datasets include ArabicMMLU (14,575 native Arabic MCQs from educational exams), AraTrust (522 human-written trustworthiness questions), BALSAM (78 tasks, 52,000 samples with private test sets), and SILMA AI’s ABB (470 human-validated questions from 64 datasets). Speech evaluation includes the SADA corpus (668 hours of Saudi Arabic audio) and the Open Universal Arabic ASR Leaderboard. Dialectal evaluation includes the NADI shared task series and the MADAR parallel dialect corpus.

Dataset Quality and Contamination Concerns

Arabic AI dataset quality has emerged as a critical concern as the evaluation ecosystem matures. Machine-translated datasets — where English evaluation questions were translated to Arabic — inflate scores for models trained on translated content, producing misleading evaluation results. The OALL version 2’s removal of translated tasks represents the community’s response to this concern.

Data contamination — where evaluation questions appear in training data — poses another integrity challenge. BALSAM’s private test sets address this concern for evaluation data. For training data, the quality challenge involves ensuring that Arabic text is native rather than machine-translated, orthographically consistent, and representative of the dialectal and register diversity that production Arabic AI applications encounter. The CAMeL Lab’s CODA orthographic standard, morphological analysis tools, and curated dialect corpora provide the quality infrastructure that Arabic dataset curation requires.

The investment trajectory in MENA AI — $858 million in AI VC during 2025, Saudi Arabia’s $9.1 billion, Project Transcendence’s $100 billion — ensures continued investment in Arabic dataset development. As commercial deployment generates user interaction data (properly anonymized), this production data supplements pre-deployment datasets with real-world Arabic usage patterns that curated datasets cannot replicate.

Arabic Dataset Governance and Future Requirements

Arabic dataset governance — ensuring that datasets used for training and evaluation comply with copyright, privacy, and cultural sensitivity requirements — is emerging as a critical concern. Saudi Arabia’s PDPL imposes requirements on personal data within training datasets. Copyright frameworks across Arabic-speaking countries affect the legality of including published Arabic content in training corpora. Cultural sensitivity requirements — particularly for religious content, political speech, and social commentary — add governance dimensions absent from English dataset governance.

Future Arabic dataset development will need to address consent frameworks for user-generated content, compensation models for Arabic content creators whose work appears in training data, and quality verification processes that ensure dataset accuracy across the diversity of Arabic varieties and knowledge domains. The MENA AI ecosystem’s investment trajectory — $858 million in AI VC, $9.1 billion in Saudi AI funding, the $100 billion Project Transcendence allocation — provides financial resources for dataset development at scale, but the governance infrastructure for responsible Arabic dataset curation requires institutional development alongside financial investment.

Key Arabic Datasets by Domain

Understanding the Arabic dataset landscape requires mapping available resources to their intended applications. For LLM pre-training, the critical datasets include web-crawled Arabic text (filtered and deduplicated), Arabic Wikipedia (high quality but limited volume), Arabic news archives (formal MSA with temporal breadth), and Arabic book digitization projects. Jais 2 assembled 600 billion Arabic tokens from these sources — the richest Arabic-first training dataset at release. ALLaM used 500 billion Arabic tokens incorporating 16 public entities and 300 Arabic books.

For evaluation, native Arabic benchmarks have replaced machine-translated alternatives. ArabicMMLU provides 14,575 educational exam questions. AraTrust provides 522 trustworthiness evaluation questions. BALSAM provides 52,000 samples across 78 tasks with private test sets. SILMA ABB provides 470 human-validated questions from 64 source datasets across 22 categories.

For dialectal research, the CAMeL Lab corpora remain foundational. MADAR provides parallel sentences across 25 city dialects plus English, French, and MSA. GUMAR provides 100 million words of Gulf Arabic. CAMeLTB provides 188,000 words of dependency-annotated Arabic spanning pre-Islamic poetry to social media. QALB provides 2 million words of manually error-corrected Arabic. SAMER provides a 26,000-lemma readability lexicon for MSA.

For speech, the SADA corpus provides 668 hours of Saudi television audio, and Mozilla Common Voice provides crowd-sourced Arabic read speech for ASR model training and evaluation.

Arabic AI Benchmarks — Full benchmark coverage
Arabic LLMs — Model performance context
OALL Benchmark Analysis — Leaderboard methodology
Arabic LLM Training Data — Cross-model corpus comparison
CAMeL Tools — NLP toolkit using these datasets
Arabic Dialect Coverage — Dialect representation in data
Arabic Speech Recognition — ASR dataset usage
ArabicMMLU Results — Educational exam benchmark

Arabic BenchmarksEvaluationArabic AI