11.png

coreOCR-7B-050325-preview

The coreOCR-7B-050325-preview model is a fine-tuned version of Qwen/Qwen2-VL-7B, optimized for Document-Level Optical Character Recognition (OCR), long-context vision-language understanding, and accurate image-to-text conversion with mathematical LaTeX formatting. Designed with a focus on high-fidelity visual-textual comprehension, this model enhances document parsing, structured data extraction, and complex visual reasoning.

Key Enhancements

  • Advanced Document-Level OCR: Accurately processes and extracts structured text from complex, multi-page documents including invoices, forms, and research papers.

  • Enhanced Long-Context Vision-Language Understanding: Supports long-text retrieval and reasoning from documents and multimedia inputs, including dense text blocks, diagrams, and math content.

  • SoTA Understanding Across Image Resolutions: Achieves state-of-the-art results on visual benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA.

  • Video Comprehension up to 20+ minutes: Capable of high-quality video-based question answering, dialogue generation, and content summarization from long video sequences.

  • Device Control via Visual Commands: With complex reasoning and perception capabilities, it can be integrated with devices like mobile phones or robots for visually grounded automation.

Quick Start with Transformers

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/coreOCR-7B-050325-preview", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/coreOCR-7B-050325-preview")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Training Details

Parameter Value
Dataset Size 274,209 samples (Modular Combination of Datasets)
Model Architecture Qwen2VLForConditionalGeneration
Hardware 2 × NVIDIA A100 SXM (with 32 vCPUs)
Total Disk 160,000 MB
Training Time 10,390 seconds (~2.88 hours)
Learning Rate 1e-5
Scheduler Linear Decay
Warmup Steps 700
Precision bfloat16

The open dataset image-text response will be updated soon.

Intended Use

This model is intended for:

  • Document analysis and OCR from scanned images, PDFs, and camera input.
  • Image-based question answering (e.g., educational content, diagrams, receipts).
  • Math problem solving and LaTeX text generation from handwritten or printed math content.
  • Long-context vision-text applications such as multi-slide document retrieval and dense information extraction.
  • Multilingual OCR workflows for cross-lingual business documents and global data digitization.
  • AI agents for mobile/robotic interaction through visual context.

Limitations

  • Performance may degrade on extremely noisy or low-resolution images.
  • Not suitable for real-time inference on edge devices due to model size and memory demands.
  • While multilingual, performance on low-resource or rare scripts may vary.
  • Not optimized for high-speed processing of video streams in constrained environments.
  • Contextual understanding depends on visual tokenization parameters; improper configuration may affect output quality.
  • Outputs may occasionally include hallucinations or incomplete answers in long-context queries.

References

Downloads last month
0
Safetensors
Model size
8.29B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/coreOCR-7B-050325-preview

Base model

Qwen/Qwen2-VL-7B
Finetuned
(16)
this model

Datasets used to train prithivMLmods/coreOCR-7B-050325-preview

Collection including prithivMLmods/coreOCR-7B-050325-preview