coreOCR-7B-050325-preview

The coreOCR-7B-050325-preview model is a fine-tuned version of Qwen/Qwen2-VL-7B, optimized for Document-Level Optical Character Recognition (OCR), long-context vision-language understanding, and accurate image-to-text conversion with mathematical LaTeX formatting. Designed with a focus on high-fidelity visual-textual comprehension, this model enhances document parsing, structured data extraction, and complex visual reasoning.

Key Enhancements

Advanced Document-Level OCR: Accurately processes and extracts structured text from complex, multi-page documents including invoices, forms, and research papers.
Enhanced Long-Context Vision-Language Understanding: Supports long-text retrieval and reasoning from documents and multimedia inputs, including dense text blocks, diagrams, and math content.
SoTA Understanding Across Image Resolutions: Achieves state-of-the-art results on visual benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA.
Video Comprehension up to 20+ minutes: Capable of high-quality video-based question answering, dialogue generation, and content summarization from long video sequences.
Device Control via Visual Commands: With complex reasoning and perception capabilities, it can be integrated with devices like mobile phones or robots for visually grounded automation.

Quick Start with Transformers

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/coreOCR-7B-050325-preview", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/coreOCR-7B-050325-preview")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Training Details

Parameter	Value
Dataset Size	274,209 samples (Modular Combination of Datasets)
Model Architecture	`Qwen2VLForConditionalGeneration`
Hardware	2 × NVIDIA A100 SXM (with 32 vCPUs)
Total Disk	160,000 MB
Training Time	10,390 seconds (~2.88 hours)
Learning Rate	1e-5
Scheduler	Linear Decay
Warmup Steps	700
Precision	bfloat16

The open dataset image-text response will be updated soon.

Intended Use

This model is intended for:

Document analysis and OCR from scanned images, PDFs, and camera input.
Image-based question answering (e.g., educational content, diagrams, receipts).
Math problem solving and LaTeX text generation from handwritten or printed math content.
Long-context vision-text applications such as multi-slide document retrieval and dense information extraction.
Multilingual OCR workflows for cross-lingual business documents and global data digitization.
AI agents for mobile/robotic interaction through visual context.

Limitations

Performance may degrade on extremely noisy or low-resolution images.
Not suitable for real-time inference on edge devices due to model size and memory demands.
While multilingual, performance on low-resource or rare scripts may vary.
Not optimized for high-speed processing of video streams in constrained environments.
Contextual understanding depends on visual tokenization parameters; improper configuration may affect output quality.
Outputs may occasionally include hallucinations or incomplete answers in long-context queries.

References

DocVLM: Make Your VLM an Efficient Reader https://arxiv.org/pdf/2412.08746v1
YaRN: Efficient Context Window Extension of Large Language Models
https://arxiv.org/pdf/2309.00071
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
https://arxiv.org/pdf/2409.12191
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
https://arxiv.org/pdf/2308.12966
A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy https://arxiv.org/pdf/2412.02210

prithivMLmods
/

coreOCR-7B-050325-preview

coreOCR-7B-050325-preview

Key Enhancements

Quick Start with Transformers

Training Details

Intended Use

Limitations

References

Model tree for prithivMLmods/coreOCR-7B-050325-preview

Datasets used to train prithivMLmods/coreOCR-7B-050325-preview

Spaces using prithivMLmods/coreOCR-7B-050325-preview 3

Collections including prithivMLmods/coreOCR-7B-050325-preview

Core & DocScope OCR Models

VisionScope OCR Experimentals

Multimodal VLMs - Until July'25