Research Lab

CAMeL Tools — NYU Abu Dhabi's Comprehensive Arabic NLP Toolkit

Profile of CAMeL Tools, the open-source Arabic NLP suite from NYU Abu Dhabi's CAMeL Lab — covering morphological analysis, diacritization, dialect identification, and integration with Arabic AI pipelines.

CAMeL Tools is the most comprehensive open-source toolkit for Arabic natural language processing, developed by the Computational Approaches to Modeling Language Lab at New York University Abu Dhabi. Directed by Dr. Nizar Habash — one of the world’s foremost experts on Arabic computational linguistics — the toolkit provides production-grade components for morphological analysis, diacritization, dialect identification, sentiment analysis, and other core NLP tasks that Arabic AI applications require.

The camel-tools Python package integrates seamlessly with modern Arabic AI pipelines, serving as the preprocessing layer that prepares Arabic text for processing by large language models, as the postprocessing layer that validates and refines LLM outputs, and as a standalone analysis toolkit for Arabic text mining and information extraction tasks.

Core Components

Morphological Analysis and Disambiguation — CAMeL Tools provides state-of-the-art Arabic morphological analysis through Calima Star, an extension of the BAMA/SAMA family of morphological analyzers. Given an Arabic word, the analyzer generates all possible morphological analyses — including root, lemma, part-of-speech tag, morphological features, and gloss — and the disambiguator selects the contextually appropriate analysis. This capability is essential for Arabic NLP because the absence of short vowel diacritics in standard Arabic text creates massive ambiguity: the average Arabic word has 12 possible morphological analyses.

Diacritization — The automatic addition of short vowel diacritics (tashkeel) to Arabic text. Diacritization resolves the fundamental ambiguity of undiacritized Arabic, enabling accurate text-to-speech synthesis, improving machine translation quality, and disambiguating homographic words that differ only in their vowel patterns.

Dialect Identification — CAMeL Tools includes dialect identification models that classify Arabic text by regional variety. This capability enables dialect-aware processing pipelines that route text through dialect-specific components, improving accuracy for applications serving diverse Arabic-speaking populations.

Named Entity Recognition — Identification and classification of person names, locations, organizations, and other named entities in Arabic text. Arabic NER is complicated by the language’s morphological complexity — entity names may appear with different prefixed prepositions, conjunctions, and definite articles that must be stripped for accurate entity matching.

Research Contributions

The CAMeL Lab’s research contributions extend far beyond the toolkit itself. The MADAR Corpus provides parallel sentences covering 25 city dialects — the most geographically comprehensive Arabic dialect resource available. The GUMAR Corpus contains 100 million words of Gulf Arabic, providing the dialect-specific data that Arabic NLP systems require. CODA (Conventional Orthography for Dialectal Arabic) establishes orthographic standards for writing Arabic dialects — a foundational contribution that enables consistent dialectal text processing.

The lab has published extensively on Arabic grammatical error detection and correction, text readability assessment, treebank construction, and machine translation, contributing both datasets and methodologies that the broader Arabic NLP community builds upon.

Integration with Arabic AI Pipelines

In modern Arabic AI systems, CAMeL Tools serves as a critical preprocessing and validation layer. Before text reaches an LLM for reasoning, CAMeL Tools can perform morphological analysis to extract linguistic features, identify the dialect to enable dialect-aware prompt engineering, diacritize text to resolve ambiguities, and extract named entities for structured information retrieval.

After the LLM generates output, CAMeL Tools can validate grammatical correctness, verify that named entities are correctly formed, check dialect consistency, and assess text quality against Arabic linguistic standards.

MADAMIRA and YAMAMA Analyzers

Beyond the CAMeL Tools Python package, the CAMeL Lab maintains two additional morphological analysis systems that serve different deployment requirements. MADAMIRA (Morphological Analysis and Disambiguation of Arabic) represents the state-of-the-art Arabic morphological tagger, performing diacritization, lemmatization, POS tagging, and NER in a single integrated pipeline. MADAMIRA’s accuracy exceeds other available systems on MSA text, making it the standard reference for Arabic morphological analysis in research settings.

YAMAMA was designed specifically as a multi-dialect Arabic morphological analyzer, addressing the limitation that MADAMIRA was optimized primarily for MSA. YAMAMA runs 5x faster than MADAMIRA — a performance advantage that matters for production AI systems processing Arabic text in real-time. Arabic chatbot platforms, voice AI systems, and interactive applications that require sub-second morphological analysis benefit from YAMAMA’s speed, while batch processing and offline analysis applications can use MADAMIRA’s higher accuracy.

CaMeL Parser provides dependency parsing trained on the CATiB (Columbia Arabic Treebank) treebank, extracting syntactic structure from Arabic sentences. Dependency parsing reveals grammatical relationships — which word modifies which, what serves as subject versus object — information essential for question answering, relation extraction, and semantic analysis tasks in Arabic AI pipelines.

Corpora and Linguistic Resources

The CAMeL Lab maintains several major Arabic corpora that serve as evaluation benchmarks and training resources for the broader Arabic NLP community.

MADAR Corpus — Parallel sentences across 25 city dialects plus English, French, and MSA. This is the most geographically comprehensive Arabic parallel dialect corpus available, covering cities from Rabat (Morocco) to Muscat (Oman). The parallel structure enables dialect identification training, cross-dialect translation evaluation, and systematic analysis of dialectal variation across the Arab world.

GUMAR Corpus — 100 million words of Gulf Arabic collected from online forums and social media. GUMAR provides the largest publicly available Gulf Arabic text collection, enabling training and evaluation of NLP systems targeting the Gulf dialect family (UAE, Saudi, Kuwaiti, Bahraini, Qatari, Omani varieties).

CaMeL Treebank — 188,000 words of syntactically annotated Arabic text spanning from pre-Islamic poetry to contemporary social media. The temporal breadth enables linguistic analysis across Arabic’s full historical range, while the dependency annotation provides gold-standard syntactic structure for parser training and evaluation.

QALB Corpus — 2 million manually corrected Arabic words, providing gold-standard error correction annotations for Arabic grammatical error detection and correction systems. QALB enables training and evaluation of Arabic writing assistance tools that identify and correct grammatical errors in Arabic text.

SAMER Lexicon — 26,000 lemmas annotated for readability in MSA, supporting text complexity assessment for educational applications. SAMER enables Arabic reading level classification, curriculum-appropriate content recommendation, and text simplification systems that adapt Arabic content for different reading levels.

Role in Arabic LLM Ecosystem

CAMeL Tools occupies a foundational position in the Arabic AI ecosystem. The three leading Arabic LLMs — Jais 2 (G42/MBZUAI/Cerebras), ALLaM 34B (HUMAIN), and Falcon-H1 Arabic (TII) — all benefit from Arabic NLP research that CAMeL Lab pioneered. The morphological analysis methodology, dialect identification approaches, and evaluation frameworks that CAMeL Lab established inform the training data curation, model evaluation, and deployment pipeline design of all Arabic LLM development programs.

Jais 2’s training on 17 identified regional dialects relies on dialect identification capabilities that CAMeL Lab research enabled. ALLaM’s engagement of 400 subject matter experts for model testing reflects evaluation methodologies that Arabic NLP research validated. Falcon Arabic’s emphasis on native (non-translated) training data responds to research demonstrating the quality difference between native and translated Arabic content — research that CAMeL Lab contributed to through systematic analysis of Arabic text quality.

The Open Arabic LLM Leaderboard’s version 2 benchmarks — ArabicMMLU, ALRAGE, AraTrust, MadinahQA — evaluate capabilities that CAMeL Lab’s tools help develop and validate. ArabicMMLU’s Arabic language understanding questions directly test the grammatical and morphological knowledge that CAMeL Tools helps models acquire through preprocessing and evaluation.

Agentic AI Integration

In the context of agentic AI frameworks — LangGraph, AutoGen, CrewAI — CAMeL Tools provides the Arabic-specific tool layer that standard frameworks do not include. A LangGraph-based Arabic agent can integrate CAMeL Tools as dedicated processing nodes: a dialect identification node routes input to dialect-specific processing branches, a morphological analysis node enriches text with linguistic features before the reasoning LLM processes it, and a diacritization node prepares generated text for TTS output.

CrewAI’s role-based architecture enables CAMeL Tools integration through specialized agent roles. A morphological analysis agent wrapping CAMeL Tools can serve multiple other agents in a crew, providing shared linguistic preprocessing that each downstream agent benefits from. AutoGen’s conversation-based model can assign CAMeL Tools operations to specialized tool-executor agents that provide linguistic analysis services to other agents in the conversation.

The 300,000+ possible POS tags and 12 average morphological analyses per word that Arabic exhibits make CAMeL Tools preprocessing essential for agents that must understand Arabic text accurately. Without morphological preprocessing, Arabic agents must rely solely on the LLM’s implicit morphological knowledge — knowledge that varies in quality across models and degrades significantly for dialectal Arabic.

Commercial and Strategic Impact

CAMeL Tools’ open-source availability has lowered the barrier to entry for Arabic AI development across the MENA region. Startups building Arabic NLP products — customer service chatbots, content moderation systems, social media analytics platforms, legal document analyzers — can access production-grade Arabic morphological analysis without building proprietary linguistic infrastructure. This democratization of Arabic NLP capability has contributed to the ecosystem growth that MENA AI investment statistics reflect: $858 million in AI-focused VC during 2025, with the UAE AI market projected to grow from $578 million in 2024 to $4.25 billion by 2033 at a 22.07 percent CAGR.

The strategic significance of CAMeL Lab’s institutional position at NYU Abu Dhabi — an American university operating in the UAE — creates a bridge between Western computational linguistics research and Arabic-world deployment reality. The lab’s researchers publish at ACL, EMNLP, and other top conferences while maintaining daily immersion in Arabic-speaking environments, producing tools that reflect both computational sophistication and genuine linguistic understanding. This dual positioning has made CAMeL Lab the most cited Arabic NLP research group globally, with citations from research teams across MBZUAI, KAUST, TII, HUMAIN, and dozens of universities across the Arabic-speaking world.

Saudi Arabia’s Year of AI 2026 designation and the kingdom’s 664 operating AI companies create growing demand for Arabic NLP tools that CAMeL Tools helps satisfy. The $10 billion HUMAIN venture fund and $1 billion GAIA Accelerator provide ecosystem funding for startups building on Arabic NLP infrastructure, many of which incorporate CAMeL Tools as a foundational dependency. The tools’ stability, documentation quality, and research backing make them the default choice for Arabic NLP preprocessing in both academic and commercial contexts — a position that influences the entire Arabic AI ecosystem’s technical direction.

Future Development Trajectory

CAMeL Lab’s ongoing research addresses several frontiers in Arabic NLP that will shape future CAMeL Tools releases. Large language model integration — using models like Jais 2, ALLaM, and Falcon Arabic to improve morphological disambiguation accuracy — represents a natural evolution of the toolkit’s disambiguation capabilities. Fine-grained dialect identification, moving beyond country-level classification to city-level distinction (following the NADI shared task trajectory), will improve the toolkit’s dialect-aware processing precision. Multimodal Arabic NLP — integrating text analysis with Arabic OCR, handwriting recognition, and visual document understanding — expands the toolkit’s applicability to document processing workflows that combine visual and linguistic analysis.

The relationship between CAMeL Tools and Arabic LLM development is bidirectional. CAMeL Tools improves LLM training data quality through morphological filtering and normalization. LLMs improve CAMeL Tools accuracy through neural disambiguation models. This virtuous cycle ensures that progress in either domain accelerates progress in the other, contributing to the rapid advancement of Arabic AI capability that the past three years have demonstrated.

The QALB corpus — 2 million manually corrected Arabic words — provides the gold-standard error correction data that enables Arabic grammatical error detection and correction systems, directly applicable to content quality assurance in Arabic AI outputs. The SAMER lexicon’s 26,000 lemmas annotated for readability in MSA support text complexity assessment for educational applications, enabling Arabic AI systems that adapt content difficulty to reader proficiency levels. These resources, while smaller than LLM training corpora, provide the annotated linguistic data that evaluation frameworks require to assess model quality on specific Arabic linguistic tasks beyond aggregate benchmark scores.

The comprehensive scope of CAMeL Tools — spanning morphological analysis, diacritization, dialect identification, named entity recognition, sentiment analysis, and transliteration within a single Python package — provides Arabic AI developers with a unified toolkit that eliminates the integration complexity of assembling Arabic NLP capability from disparate sources. This integration advantage, combined with the lab’s research credibility and open-source commitment, ensures that CAMeL Tools will remain the foundational Arabic NLP toolkit as the Arabic AI ecosystem continues its rapid expansion.

Arabic Morphological Analysis — Deep dive into morphological processing
Arabic Diacritization — Vowelization systems and challenges
CODA Orthography Standard — Dialectal Arabic writing conventions
Arabic AI Research Landscape — Academic institutions
MBZUAI Profile — Complementary research institution
Arabic Agent Architecture — Tool integration patterns
Arabic Morphology Encyclopedia — Root-pattern system
Arabic Dialect Coverage — Cross-model dialect performance

CAMeL ToolsNYU Abu DhabiArabic NLPMorphological Analysis