Arabic LLMs

Arabic Dialect Coverage — MSA and Dialectal Performance Across Major Arabic LLMs

Comparative analysis of dialect coverage across Jais, ALLaM, Falcon Arabic, and AceGPT — performance on MSA versus regional varieties including Gulf, Egyptian, Levantine, and Maghrebi Arabic.

Donovan Vanderbilt · Updated March 18, 2026 · 10 min read

Arabic is not one language but a continuum of varieties spanning 22 countries and over 400 million speakers. Modern Standard Arabic — the formal register used in news, education, government, and inter-regional communication — coexists with dozens of regional dialects that differ from MSA and from each other as significantly as Romance languages differ from Latin. For Arabic large language models, dialect coverage is not a feature — it is a fundamental requirement for real-world utility.

The Dialect Challenge

The challenge of Arabic dialect coverage in LLMs begins with data representation. Modern Standard Arabic dominates Arabic digital content because it is the standard for news organizations, academic publications, government documents, and formal web content. Dialectal Arabic, despite being the primary language of daily communication for virtually all Arabic speakers, is underrepresented in written digital form. When dialects appear in text, they are concentrated in social media, messaging platforms, and informal online forums — sources that are voluminous but noisy.

This data imbalance means that LLMs trained on Arabic web crawl data without dialect-aware curation develop strong MSA capabilities but weak dialectal performance. Users interacting in their native dialect — Egyptian Arabic for a Cairo customer service agent, Gulf Arabic for a Riyadh social media analyst, Levantine Arabic for a Beirut content creator — encounter a model that responds in formal MSA even when addressed in dialect. This register mismatch undermines user trust and limits practical deployment.

Model-by-Model Dialect Coverage

Jais 2 provides the broadest explicit dialect coverage among current Arabic LLMs, training on 17 identified regional varieties. The dialect inventory covers six Gulf varieties (UAE, Saudi, Kuwaiti, Bahraini, Qatari, Omani), Egyptian Arabic, four Levantine varieties (Palestinian, Jordanian, Lebanese, Syrian), Iraqi Arabic, four Maghrebi varieties (Moroccan, Algerian, Tunisian, Libyan), and Sudanese Arabic. This coverage is supplemented by Arabizi support, enabling the model to process dialectal Arabic written in Latin characters.

ALLaM benefits from Saudi-centric dialect coverage, with particular strength in Gulf Arabic varieties. The training data assembled from 16 Saudi government entities provides deep coverage of Saudi administrative and professional Arabic, while the model’s broader Arabic training ensures MSA competence. ALLaM’s dialect capabilities are strongest for Saudi and Gulf varieties and progressively weaker for geographically distant dialects.

Falcon Arabic emphasizes native dialect training data, with Falcon-H1 Arabic improving dialect coverage through expanded training corpora. The Open Arabic LLM Leaderboard results confirm strong performance across MSA and major dialects, though the model’s performance varies by dialect in patterns consistent with training data representation.

AceGPT approaches dialect coverage through cultural alignment rather than raw data volume. The RLAIF methodology enables the model to adapt its outputs based on cultural context, partially compensating for limited dialectal training data. However, the model’s reliance on Llama 2’s tokenizer limits its efficiency when processing dialectal Arabic with non-standard orthography.

Performance Patterns

Across all models, a consistent pattern emerges: MSA performance significantly exceeds dialectal performance. The Open Universal Arabic ASR Leaderboard confirms this pattern for speech recognition, and analogous gaps exist in text generation and understanding tasks. Egyptian Arabic typically shows the strongest dialectal performance after MSA, likely because Egypt’s media industry generates the largest volume of dialectal Arabic content available for training.

Gulf Arabic varieties show strong performance in models with UAE or Saudi development connections — Jais excels at Emirati and Gulf varieties, ALLaM at Saudi Arabic. Levantine and Maghrebi varieties consistently show weaker performance across all models, reflecting their lower representation in training corpora and the greater linguistic distance from MSA.

Arabizi — The Fifth Register

Beyond MSA and regional dialects, Arabizi represents a distinct register that Arabic LLMs must process. Arabizi — Arabic written using Latin characters, often with numerals substituting for Arabic letters lacking Latin equivalents (2 for hamza, 3 for ain, 7 for ha) — is the dominant mode of informal digital communication among younger Arabic speakers across messaging platforms, social media comments, and online forums.

Jais 2 incorporates substantial Arabizi training data, recognizing that this register is essential for applications targeting younger demographics — social media analysis, customer service chatbots, educational platforms, and marketing analytics. ALLaM’s training data, drawn from government and institutional sources, naturally underrepresents Arabizi, creating a capability gap for informal digital communication use cases. Falcon Arabic’s emphasis on native Arabic training data includes Arabizi representation, though the specific proportion is not publicly documented. AceGPT’s character-level tokenization actually provides an incidental advantage for Arabizi processing, since Latin-character Arabic tokens align naturally with Llama 2’s English-optimized tokenizer.

The NADI (Nuanced Arabic Dialect Identification) shared task series provides ongoing evaluation of dialect identification capabilities, including Arabizi recognition. Models that perform well on formal written dialects often struggle with Arabizi because the Latin orthography strips away diacritical marks, letter shape distinctions, and other cues that Arabic script provides for dialect identification.

Benchmark Evaluation of Dialect Performance

The Open Arabic LLM Leaderboard’s version 2 benchmarks provide indirect assessment of dialect capabilities through native Arabic evaluations. ArabicMMLU’s 14,575 questions from educational exams across Arab countries include regional educational content that reflects different Arabic-speaking countries’ linguistic conventions. AraTrust’s 522 human-written questions assess cultural sensitivity that varies by dialect region. However, no single benchmark systematically evaluates performance across all 17 dialect varieties that Jais 2 claims to support.

BALSAM’s 78 tasks and 52,000 samples include dialectal Arabic evaluation, but the private test sets that prevent contamination also prevent detailed analysis of per-dialect performance. SILMA AI’s Arabic Broad Benchmark, with 470 questions from 64 Arabic datasets across 22 categories, provides additional dialectal coverage but does not isolate dialect-specific performance metrics.

The gap between aggregate benchmark scores and dialect-specific performance remains a significant challenge for Arabic AI evaluation. A model scoring 75 percent on the OALL might achieve 85 percent on MSA tasks and 55 percent on Maghrebi Arabic tasks — yet the aggregate score conceals this variance. Organizations deploying Arabic LLMs for dialect-specific use cases must conduct their own dialect-specific evaluations rather than relying on aggregate metrics.

Implications for Deployment

The dialect performance gap has direct implications for commercial deployment. Customer service chatbots deployed in Egypt must handle Egyptian Arabic fluently, not respond in MSA when addressed in colloquial Egyptian. Social media analysis tools must correctly interpret dialect-specific vocabulary and sentiment markers. Healthcare applications must understand patients who describe symptoms in their local dialect rather than medical MSA.

Arabic chatbot platforms have developed practical approaches to dialect handling. Arabot uses a proprietary private LLM with Arabic dialect understanding capabilities, supplemented by public LLM integration with ChatGPT and Gemini for broader knowledge access. Maqsam employs a dual-model architecture combining text and audio processing with multi-dialect reasoning, operating across Saudi Arabia, Egypt, Jordan, UAE, and Qatar. YourGPT provides specific support for Gulf, Egyptian, and Levantine dialects alongside Turkish, Hebrew, and Kurdish. Thinkstack covers Gulf, Egyptian, Levantine, and Maghrebi dialects with local slang adaptation.

These platform approaches demonstrate that production dialect handling often requires combining LLM capabilities with dialect-specific routing, user dialect detection, and fallback mechanisms — engineering complexity that benchmark evaluations do not capture but deployment success depends on.

Morphological Complexity and Dialect Processing

Arabic’s morphological complexity creates unique challenges for dialect processing in LLMs. The CAMeL Lab at NYU Abu Dhabi has documented over 300,000 possible POS tags for Arabic versus approximately 50 for English, with an average of 12 morphological analyses per word. Dialectal Arabic adds further complexity because dialects often employ morphological patterns that differ from MSA — Gulf Arabic uses different pronominal suffix forms, Egyptian Arabic has distinct negation patterns, and Maghrebi Arabic employs verb conjugation forms absent from MSA.

Morphological analysis tools designed for MSA — including MADAMIRA, CALIMA Star, and YAMAMA from the CAMeL Lab — require adaptation for dialectal processing. YAMAMA, designed specifically as a multi-dialect Arabic morphological analyzer running 5x faster than MADAMIRA, represents progress toward dialect-aware morphological analysis. The MADAR corpus, containing parallel sentences in 25 city dialects plus English, French, and MSA, provides training data for dialect identification and processing.

The GUMAR corpus — 100 million words of Gulf Arabic — and similar dialect-specific corpora enable focused evaluation of dialect processing quality. Models demonstrating strong performance on these targeted corpora typically deliver better real-world results for dialect-specific applications than models that score well only on aggregate benchmarks.

Organizations deploying Arabic LLMs for dialect-sensitive applications should evaluate model performance on the specific dialect or dialects relevant to their use case, rather than relying on aggregate benchmark scores that typically overweight MSA performance.

Future Directions in Dialect Coverage

The Arabic AI ecosystem’s investment trajectory suggests that dialect coverage will continue to expand. Saudi Arabia’s $9.1 billion in AI funding during 2025 and the UAE’s AI market projected to reach $4.25 billion by 2033 provide the financial foundation for sustained data collection and model development. HUMAIN’s $10 billion venture fund and the $1 billion GAIA Accelerator can finance dialect-specific data collection campaigns that produce the labeled corpora needed for underrepresented varieties.

The NADI shared task series continues to advance dialect identification capabilities, with each iteration introducing more fine-grained dialect classification challenges. Recent NADI tasks have moved beyond country-level identification toward city-level dialect classification — distinguishing Cairene from Alexandrian Egyptian Arabic, or Riyadhi from Jeddawi Saudi Arabic. Models that achieve high accuracy on these fine-grained tasks demonstrate dialect awareness beyond what aggregate benchmarks measure.

Speech recognition advances contribute to dialect coverage through transcription. The SADA corpus — 668 hours of Saudi Arabic audio from television shows — and similar dialect-specific speech datasets enable models to learn from spoken dialectal content that has no written equivalent. The Open Universal Arabic ASR Leaderboard tracks dialect-specific recognition accuracy, with top models including NVIDIA Conformer-CTC-Large and Whisper Large series showing measurable performance variation across dialects that mirrors the text-based dialect coverage patterns.

Cross-dialect transfer learning represents a promising research direction. Models trained on well-resourced dialects (Egyptian, Gulf) may transfer partially to underresourced dialects (Sudanese, Libyan) through shared Arabic linguistic features. Research at MBZUAI and KAUST is exploring whether dialect-aware fine-tuning strategies can improve performance on low-resource dialects without proportional data collection investments — a capability that would accelerate dialect coverage expansion across the Arabic-speaking world’s full linguistic diversity.

Commercial Implications of Dialect Performance Gaps

The commercial value of dialect coverage varies dramatically by market segment. Customer service deployments — where Arabot, Maqsam, YourGPT, and Thinkstack compete — require production-quality dialect handling that directly affects customer satisfaction and retention. A chatbot that responds in MSA when addressed in Egyptian Arabic creates the same disconnect as a customer service agent who insists on speaking in formal English when called by someone speaking casual American English. The $1.3 trillion annual cost reduction potential that conversational AI represents across global businesses depends on natural, dialect-appropriate interaction that current models achieve unevenly.

The healthcare sector presents particularly high-stakes dialect requirements. Patients describing symptoms in their local dialect — using Gulf Arabic terms for pain, Levantine Arabic descriptions of symptoms, or Egyptian Arabic explanations of medical history — need models that understand dialectal medical vocabulary, not just MSA clinical terminology. Misunderstanding dialect-specific symptom descriptions carries clinical risk that aggregate benchmark scores cannot quantify. HUMAIN’s engagement of medical experts during ALLaM’s training addressed this for Saudi medical Arabic specifically, but equivalent dialect-specific medical validation has not been conducted for most other varieties.

The education sector similarly requires dialect awareness. Saudi Arabia’s designation of 2026 as the Year of AI, with educational AI deployments across the kingdom’s school system, demands models that understand how Saudi students actually communicate — mixing Gulf Arabic dialect with MSA in ways that pure-MSA models misinterpret. The 664 AI companies operating in Saudi Arabia include educational technology firms building Arabic tutoring systems, adaptive learning platforms, and assessment tools that all depend on accurate dialect comprehension for the specific student populations they serve.

Arabic LLM Training Data — How training data composition affects dialect performance
Arabic Chatbots — Dialect requirements for conversational AI
Arabic Speech Recognition — ASR dialect challenges and solutions
MSA vs Dialects Encyclopedia — Linguistic classification and variation
Arabic Morphological Analysis — Computational morphology approaches
CAMeL Tools — NYU Abu Dhabi NLP toolkit
Jais — Arabic LLM — Broadest dialect coverage (17 varieties)
ALLaM — Saudi National Model — Gulf Arabic specialization

Arabic DialectsMSAGulf ArabicEgyptian ArabicLevantine ArabicMaghrebi Arabic