NLP

CODA — Conventional Orthography for Dialectal Arabic

Analysis of CODA, the computational orthography standard for Arabic dialects developed by CAMeL Lab researchers — covering 28 city dialects and enabling consistent dialectal text processing.

Donovan Vanderbilt · Updated March 20, 2026 · 10 min read

CODA (Conventional Orthography for Dialectal Arabic) addresses a fundamental challenge in Arabic NLP: Arabic dialects have no standard written form. While Modern Standard Arabic has well-established orthographic conventions, dialectal Arabic is written inconsistently — the same word may be spelled differently by different speakers, in different contexts, or even by the same speaker at different times. This orthographic variation creates severe challenges for NLP systems that rely on consistent text representations for accurate processing.

Developed by CAMeL Lab researchers at NYU Abu Dhabi under the direction of Dr. Nizar Habash, CODA establishes computational writing standards for Arabic dialects. Earlier versions targeted specific dialects — Egyptian, Palestinian, Tunisian, Algerian, and Gulf Arabic — while the most recent iteration, CODA-Star, covers 28 city dialects across the Arabic-speaking world. The star variant provides a unified framework that captures both the commonalities shared across dialects and the specific orthographic conventions needed for each variety.

The Orthographic Challenge

The absence of standardized dialectal orthography creates measurable problems for Arabic AI systems. Consider how Egyptian Arabic speakers write the word for “now” — it may appear as “dilwaqti,” “delwa’ti,” “dlw2ty,” or a dozen other variations across social media, messaging platforms, and online forums. Each variation represents the same word, but NLP systems processing these texts encounter them as distinct tokens, fragmenting what should be a single linguistic concept across multiple surface representations.

This fragmentation affects every downstream NLP task. Sentiment analysis systems training on dialectal Arabic encounter artificially sparse feature distributions because the same word appears in multiple orthographic forms. Named entity recognition fails when entity names are spelled inconsistently across training and test data. Text classification accuracy degrades because morphologically and semantically identical words appear as different features due to orthographic variation. Machine translation systems learn fragmented translation tables where a single source concept maps to multiple target representations.

The scale of the problem is substantial. Arabic has over 30 regional dialects spoken across 22 countries by more than 400 million speakers. Dialectal Arabic dominates informal digital communication — social media posts, messaging conversations, online forum discussions, and user reviews — the exact content categories that commercial Arabic AI applications must process. Without orthographic standardization, every dialectal NLP system must independently learn to map orthographic variants to underlying concepts, duplicating effort and reducing accuracy.

Design Principles

CODA is designed as a computational standard rather than a prescriptive language authority standard. Its goal is not to tell Arabic speakers how they should write their dialect but to provide a consistent representation that NLP systems can process reliably. The standard makes principled decisions about several categories of orthographic ambiguity.

Clitic attachment conventions specify how prefixed particles (prepositions, conjunctions, definite articles) attach to base words. CODA follows MSA conventions for clitic attachment where dialectal writing practice varies, providing consistency that enables morphological analysis tools designed for MSA to partially process dialectal text. The definite article ‘al-’ is written as a prefix following MSA convention, even in dialects where speakers may write it separately.

Phonemic representation addresses sounds that exist in dialects but have no standard MSA representation. Gulf Arabic’s emphatic consonants, Egyptian Arabic’s hard ‘g’ sound (replacing MSA ‘j’), and Maghrebi Arabic’s vowel reductions all require orthographic decisions that CODA standardizes. The standard selects Arabic characters that best represent dialectal phonemes while maintaining readability for Arabic speakers familiar with MSA orthographic conventions.

Morphological representation standardizes how dialect-specific grammatical patterns are written. Egyptian Arabic’s negation circumfix (ma-…-sh), Gulf Arabic’s pronominal suffix variations, and Levantine Arabic’s progressive aspect markers each receive consistent orthographic treatment. This morphological standardization enables morphological analysis tools to process dialectal text with improved accuracy, since the orthographic input follows predictable patterns rather than varying by writer.

Root preservation maintains the connection between dialectal word forms and their Arabic roots, enabling root-based NLP techniques (information retrieval, semantic analysis) to operate across dialectal text. When a dialect pronounces a word differently from MSA, CODA preserves the etymological spelling where practical, maintaining the root-pattern transparency that Arabic morphological analysis depends on.

CODA-Star: The Unified Framework

CODA-Star represents the most comprehensive iteration of the CODA framework, extending coverage to 28 city dialects across the Arabic-speaking world. The unified framework architecture establishes core conventions shared across all dialects alongside dialect-specific rules that capture each variety’s unique orthographic requirements.

The 28-city coverage spans the full geographic range of Arabic dialectal variation: from Rabat and Casablanca in Morocco through Tunis, Tripoli, and Cairo in North Africa; through Beirut, Damascus, Amman, and Jerusalem in the Levant; through Baghdad and Basra in Iraq; to Riyadh, Jeddah, Dubai, Abu Dhabi, Doha, and Muscat in the Gulf. This geographic comprehensiveness ensures that CODA-Star provides standardization for every major Arabic dialectal variety, enabling NLP systems to process text from any Arabic-speaking region under a consistent orthographic framework.

The MADAR corpus, also from NYU Abu Dhabi’s CAMeL Lab, provides parallel sentences across 25 city dialects plus English, French, and MSA — directly aligned with CODA-Star’s city-level dialect standardization. The parallel structure enables systematic evaluation of CODA’s impact on NLP accuracy: processing the same semantic content across different dialects with and without CODA standardization quantifies the improvement that consistent orthography provides.

Applications in Arabic AI Systems

CODA-standardized text enables measurable improvements across multiple Arabic NLP tasks. Dialect identification accuracy improves when training data uses consistent orthography, since the classifier learns to distinguish dialects based on linguistic features rather than orthographic artifacts. Machine translation quality between dialects and MSA improves when source text follows standardized conventions, reducing the noise that inconsistent spelling introduces into translation model training. Sentiment analysis on dialectal text achieves higher accuracy when the same sentiment-carrying words are consistently spelled, enabling the classifier to learn from aggregated evidence rather than fragmented orthographic variants.

For Arabic LLM training data curation, CODA provides a framework for normalizing dialectal content before inclusion in training corpora. Jais 2’s training on 17 identified regional dialects, ALLaM’s processing of informal Arabic content, and Falcon Arabic’s native training data emphasis all benefit from orthographic normalization that reduces the vocabulary size by consolidating orthographic variants into standardized forms. Smaller effective vocabularies improve model training efficiency and downstream task performance, since the model can allocate its capacity to learning linguistic patterns rather than memorizing orthographic variants.

Arabic chatbot platforms — Arabot, Maqsam, YourGPT, Thinkstack — that process dialectal customer input benefit from CODA-based preprocessing. Normalizing incoming dialectal text to CODA conventions before intent classification and response generation improves both classification accuracy and response appropriateness. The preprocessing step is computationally lightweight compared to the downstream LLM inference, making it a high-value-low-cost improvement to chatbot pipeline quality.

CODA and Arabic RAG Systems

Retrieval-augmented generation systems operating on dialectal Arabic document collections face retrieval challenges that CODA directly addresses. When documents in a knowledge base use inconsistent dialectal spelling and user queries use yet another orthographic variant, the embedding-based retrieval system may fail to match semantically identical content. CODA normalization at both indexing time (standardizing knowledge base documents) and query time (standardizing user queries) reduces these retrieval failures by ensuring that the same concept produces consistent embeddings regardless of the original orthographic form.

The morphological variant matching challenge in Arabic RAG compounds with orthographic variation in dialectal text. Without CODA, a retrieval system must handle both morphological variants (different inflected forms of the same root) and orthographic variants (different spellings of the same dialectal word) simultaneously. CODA eliminates the orthographic dimension, reducing the retrieval challenge to morphological variant handling alone — a problem that Arabic NLP tools like CAMeL Tools, MADAMIRA, and YAMAMA are specifically designed to address.

Arabizi and CODA Interaction

Arabizi — Arabic written in Latin characters — represents a separate orthographic challenge that interacts with CODA’s standardization goals. Arabizi text must first be transliterated to Arabic script before CODA conventions can be applied, and the transliteration process itself introduces ambiguity because Arabizi lacks the one-to-one character correspondence that would enable deterministic conversion. The numeral substitutions used in Arabizi (2 for hamza, 3 for ain, 7 for ha) provide partial disambiguation, but significant ambiguity remains.

Jais 2’s training on substantial Arabizi content and ALLaM’s processing of informal digital communication both encounter this interaction. CODA provides the target orthographic standard for Arabizi-to-Arabic transliteration — the question is not just “what Arabic characters do these Latin characters represent?” but “what is the standardized dialectal spelling of the resulting Arabic word?” CODA answers the second question, enabling a two-stage pipeline (Arabizi transliteration followed by CODA normalization) that produces consistently processable Arabic text from the most informal and orthographically variable input.

Research Impact and Community Adoption

CODA’s influence on Arabic NLP research extends beyond its direct application in text processing. The standard has shaped how researchers annotate Arabic dialectal corpora, how they evaluate dialectal NLP systems, and how they think about the relationship between orthographic representation and linguistic analysis. Publications referencing CODA number in the hundreds, with citations spanning Arabic morphological analysis, dialect identification, machine translation, sentiment analysis, and information retrieval — confirming the standard’s cross-cutting impact on the field.

The NADI shared task series for Arabic dialect identification uses CODA-informed evaluation criteria, linking the orthographic standard to the community evaluation campaigns that drive dialect processing improvement. Models performing well on NADI tasks increasingly incorporate CODA normalization as a preprocessing step, validating the standard’s practical value for competition-level Arabic dialect processing.

The Arabic AI ecosystem’s investment trajectory — $858 million in AI VC during 2025, Saudi Arabia’s $9.1 billion in AI funding, the UAE AI market growing at 22 percent CAGR — creates commercial demand for Arabic NLP infrastructure that handles dialectal text reliably. CODA provides the orthographic foundation that this commercial infrastructure requires, translating decades of Arabic computational linguistics research into practical tools that production Arabic AI systems depend on.

Limitations and Open Challenges

Despite its comprehensive coverage and principled design, CODA faces inherent limitations that affect its practical application. The standard necessarily makes choices that favor some dialectal writing traditions over others — a CODA convention that aligns with Egyptian Arabic writing habits may feel unfamiliar to Maghrebi Arabic writers, potentially reducing adoption in communities that perceive the standard as privileging certain dialects. The CAMeL Lab team addresses this concern through the city-level granularity of CODA-Star, which provides dialect-specific conventions rather than imposing a single standard across all varieties.

Automatic CODA normalization — converting free-form dialectal Arabic into CODA-compliant text — is itself an NLP task with imperfect accuracy. Normalization errors introduced during CODA preprocessing can propagate through downstream NLP pipelines, potentially degrading rather than improving system accuracy for text that the normalizer handles poorly. The interaction between CODA normalization errors and downstream model performance requires careful evaluation in production deployments, with monitoring systems that detect normalization failures and route affected inputs to alternative processing paths.

The relationship between CODA and evolving Arabic digital communication practices introduces a temporal challenge. As Arabic speakers develop new dialectal writing conventions — driven by platform-specific constraints, generational shifts, and cross-dialectal influence through social media — CODA must adapt to remain relevant. The standard’s computational orientation enables periodic updates, but the pace of dialectal orthographic evolution in informal digital communication may outrun the standard’s revision cycle. Future CODA iterations will need to balance stability (maintaining consistent NLP performance) with currency (reflecting actual dialectal writing practice).

CODA in Production Arabic AI Systems

Production Arabic AI systems increasingly incorporate CODA normalization as a standard preprocessing step alongside morphological analysis, diacritization, and dialect identification. The computational cost of CODA normalization is negligible relative to downstream LLM inference — adding milliseconds of preprocessing time to avoid the accuracy degradation that orthographic variation causes throughout the NLP pipeline. For Arabic RAG systems, CODA normalization at both indexing and query time is particularly valuable, ensuring consistent vector representations that maximize retrieval accuracy across dialectal documents. The investment in CODA preprocessing pays compound returns across every downstream task that benefits from orthographic consistency.

CAMeL Tools — Comprehensive Arabic NLP toolkit
Arabic Morphological Analysis — Root extraction and POS tagging
MSA vs Dialects — Linguistic classification
Arabic Dialect Coverage — LLM dialect performance
Arabic Chatbots — Dialectal chatbot deployment
RAG for Arabic — Retrieval-augmented generation
Arabic Tokenization — Token design for Arabic
Arabic LLM Training Data — Corpus composition

Arabic NLPCoda