You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

AtlasOCR: The First Open-Source Darija OCR Model

Model Description

AtlasOCR is the first open-source Optical Character Recognition (OCR) model specifically designed for Darija (Moroccan Arabic). It is built by fine-tuning the Qwen2.5-VL 3B Vision Language Model (VLM) using a comprehensive dataset of synthetic and real-world Darija text. AtlasOCR excels at extracting text from images, supporting a wide range of applications from digital preservation to social media analysis and accessibility for Moroccan content.

Key Features:

  • First Open-Source Darija OCR: Addresses a critical gap for developers and organizations working with Moroccan content.
  • Vision Language Model (VLM) Based: Leverages the power of VLMs to interpret both visual layout and linguistic context.
  • Efficient Fine-tuning: Utilizes QLoRA and Unsloth for parameter-efficient training, making it accessible on limited hardware.
  • State-of-the-Art Performance: Achieves high accuracy on Darija text and generalizes well to standard Arabic OCR tasks.
  • Comprehensive Data Curation: Trained on a unique dataset combining synthetic data from OCRSmith and curated real-world sources (scanned books, social media, educational documents, cookbooks).

Intended Use

AtlasOCR is intended for:

  • Text Extraction: Extracting Darija text from images, including social media posts, handwritten notes, scanned documents, and other visual content.
  • Digital Preservation: Converting historical Moroccan documents and manuscripts into digital, searchable formats.
  • Social Media Analysis: Understanding public discourse and sentiment in Darija-speaking communities.
  • Accessibility: Making visual content accessible to screen readers for individuals with visual impairments.
  • Research: Enabling large-scale text analysis of Moroccan content for linguistic and social studies.
  • As a Base Model: Further fine-tuning for specialized Darija OCR tasks or other VLM applications.

Limitations

  • Diacritics Handling: The model is primarily trained and evaluated on undiacritized text. Its performance on accurately recognizing or reconstructing Arabic diacritics (harakat) may vary.
  • Complex Layouts: While robust to many layouts, performance may degrade on highly complex, non-standard, or extremely cluttered document structures.
  • Language Specificity: Optimized for Darija and standard Arabic script. Performance on other Arabic dialects or languages using different scripts may not be optimal.

Model Details

Model Architecture

AtlasOCR is based on the Qwen2.5-VL 3B architecture, which is a Vision Language Model (VLM). VLMs consist of three main components:

  1. Vision Encoder: Converts images into vector embeddings capturing visual properties.
  2. Modality Projection Module: Aligns visual features with the language model's representation space.
  3. Language Model: Processes aligned embeddings and text input to generate natural language outputs.

Training Data

The model was fine-tuned on a unique and extensive dataset of Darija text, totaling 30,092 samples and 10.7 million words. The dataset composition is:

  • ~86% Synthetic Data: Generated using OCRSmith, an open-source toolkit for simulating real-world conditions (fonts, layouts, backgrounds, distortions).
  • ~14% Real-World Data: Curated from diverse sources:
    • Scanned Darija books (e.g., العَرَبِيَّةُ الدَّارِجَةُ by Mohammed El-Madlaoui El-Mounabhi, علشان الصغيرة والصغير by Farouk ElMarrakchi).
    • Social media images (poster-style PDFs with educational material).
    • Educational documents (e.g., driving license exams).
    • Cookbooks (scanned recipes in Darija). Real-world data was pseudo-labeled using Gemini 2.0 Flash and then human-annotated using Argilla for quality control.

Training Strategy

  • Base Model: Qwen2.5-VL 3B
  • Parameter-Efficient Fine-tuning:
    • QLoRA (Quantized Low-Rank Adaptation): Enabled fine-tuning of the 4-bit quantized model, significantly reducing memory requirements.
    • Unsloth: Accelerated training by up to 5x and reduced memory usage by 60% through optimized GPU kernels.
  • Key Hyperparameters (from ablation studies):
    • LoRA Rank (r) and Alpha (α): 128
    • LoRA Dropout: 0.05
    • Precision: 4-bit quantization
    • Learning Rate: 2e-4 (with batch size 16 and gradient accumulation)
    • Vision Layer Freezing: No (vision layers were fine-tuned for better performance).
    • RSLoRA: Not enabled (showed degradation in performance for this task).

Evaluation

AtlasOCR was evaluated using Character Error Rate (CER) and Word Error Rate (WER) on two benchmarks:

  1. AtlasOCRBench (Proprietary, available on Hugging Face):

    • Composition: 251 samples, including 55 from scanned Darija books and synthetic data from OCRSmith.
    • Curation: Two-step pseudo-labeling with Gemini 2.0 Flash and human annotation with Argilla.
    • Normalization: Removal of Arabic diacritics and whitespace normalization before metric calculation.
    • Primary Metric: CER, as it better reflects accuracy in Darija due to its non-standardized spelling.
  2. KITAB-Bench (Public):

    • A large-scale, multi-domain benchmark for Arabic OCR and document understanding (8,800+ samples).

Evaluation Results

AtlasBench Performance

KitabBench Performance

AtlasOCR demonstrates strong performance on both Darija-specific challenges and general Arabic OCR tasks, competing effectively with larger models.

How to Use

Installation

pip install unsloth

Inference

import os
from PIL import Image
from unsloth import FastVisionModel
import torch

class AtlasOCR:
    def __init__(self, model_name: str="atlasia/AtlasOCR-v0", max_tokens: int=2000):
        self.model, self.processor = FastVisionModel.from_pretrained(
            model_name,
            device_map="auto",
            load_in_4bit=True,
            use_gradient_checkpointing="unsloth"
        )
        self.max_tokens = max_tokens
        self.prompt = "Extract the text in the image. Give me the final text, nothing else."

    def prepare_inputs(self,image:Image):
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                    },
                    {"type": "text", "text": self.prompt},
                ],
            }
        ]

        text = self.processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )

        inputs = self.processor(
            image,
            text,
            add_special_tokens=False,
            return_tensors="pt",
        )
        return inputs

    def predict(self,image:Image) -> str:
        inputs = self.prepare_inputs(image)
        inputs = inputs.to("cuda")

        inputs['attention_mask'] = inputs['attention_mask'].to(torch.float32)
        print("attention_mask dtype:", inputs['attention_mask'].dtype)

        generated_ids = self.model.generate(**inputs, max_new_tokens=self.max_tokens, use_cache=True)
        generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = self.processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )
        return output_text[0]

    def __call__(self, _: str, image: Image) -> str:
        return self.predict(image)
if __name__=="__main__":
  atlasocr=AtlasOCR()
  img = Image.open("img.png")
  output = atlasocr(image=img)
  print(output)

Ethical Considerations and Bias

  • While AtlasOCR aims to be a valuable tool, it's important to acknowledge potential biases inherited from its training data.
  • Language Coverage: The model is specialized for Darija. Applying it to other languages or Arabic dialects without further fine-tuning might result in suboptimal performance or misinterpretations.
  • Content Bias: The real-world data sources (books, social media, educational materials) may reflect specific cultural or societal perspectives present in Moroccan content. Users should be mindful of this when interpreting results, especially in sensitive contexts.
  • Privacy: As with any OCR system, care should be taken when processing images containing personal or sensitive information. Users are responsible for ensuring compliance with privacy regulations.

Authors and Acknowledgments

AtlasOCR was developed by AtlasIA, a Moroccan AI Community dedicated to building open-source AI models and datasets for Moroccan dialects.

  • Special Thanks:
    • The Hugging Face team for providing the platform and resources for open-source AI.
    • The developers of Qwen2.5-VL 3B, Unsloth, and QLoRA for their foundational work.
    • The Argilla team for their collaborative annotation tool.

Project Resources

Support AtlasIA

If you find AtlasOCR useful and wish to support our mission of building open-source AI for Moroccan dialects, please consider donating:

Citation

If you use AtlasOCR in your research, please cite:


@misc{atlasocr2025,
  title={AtlasOCR: Open-Source OCR for Moroccan Darija with Vision–Language Models},
  author={Imane Momayiz, Soufiane Ait Elaouad, Abdeljalil Elmajjodi, Haitame Bouanane},
  year={2025},
  howpublished={\url{https://huggingface.co/atlasia/AtlasOCR}},
  organization={AtlasIA}
}

Contributions

For more information about the AtlasOCR project, visit:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train atlasia/AtlasOCR

Space using atlasia/AtlasOCR 1