File size: 6,292 Bytes
6e70ee0
9fc4c47
 
874c169
6e70ee0
 
211127a
9fc4c47
 
 
 
 
 
 
874c169
9fc4c47
 
 
 
 
6e018f8
 
67649e6
9fc4c47
 
 
 
 
 
 
 
d575518
 
b1c6263
9fc4c47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a619922
9fc4c47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e018f8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- Document
- KIE
- OCR
- VL
- Openpdf
- Camel
- text-generation-inference
- Extraction
- Linking
- Markdown
- .Md
- OpenPDF
- OCRmix
- trl
datasets:
- prithivMLmods/OpenDoc-Pdf-Preview
- prithivMLmods/Opendoc1-Analysis-Recognition
- allenai/olmOCR-mix-0225
- prithivMLmods/Openpdf-Analysis-Recognition
license: apache-2.0
---

![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/CZM7u91ww9SJPFQiY7YlI.png)

# **Camel-Doc-OCR-080125(v2-preview)**

> The **Camel-Doc-OCR-080125** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, optimized for **Document Retrieval**, **Content Extraction**, and **Analysis Recognition**. Built on top of the Qwen2.5-VL architecture, this model enhances document comprehension capabilities with focused training on the Opendoc2-Analysis-Recognition dataset for superior document analysis and information extraction tasks.

## Key Enhancements

* **Context-Aware Multimodal Extraction and Linking for Documents**: Advanced capability for understanding document context and establishing connections between multimodal elements within documents.

* **Enhanced Document Retrieval**: Designed to efficiently locate and extract relevant information from complex document structures and layouts.

* **Superior Content Extraction**: Optimized for precise extraction of structured and unstructured content from diverse document formats.

* **Analysis Recognition**: Specialized in recognizing and interpreting analytical content, charts, tables, and visual data representations.

* **State-of-the-Art Performance Across Resolutions**: Achieves competitive results on OCR and visual QA benchmarks such as DocVQA, MathVista, RealWorldQA, and MTVQA.

* **Video Understanding up to 20+ minutes**: Supports detailed comprehension of long-duration videos for content summarization, question answering, and multi-modal reasoning.

* **Visually-Grounded Device Interaction**: Enables mobile or robotic device operation via visual inputs and text-based instructions using contextual understanding and decision-making logic.

## Quick Start with Transformers

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Camel-Doc-OCR-080125", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Camel-Doc-OCR-080125")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

## Intended Use

This model is intended for:

* Context-aware multimodal extraction and linking for complex document structures.
* High-fidelity document retrieval and content extraction from various document formats.
* Analysis recognition of charts, graphs, tables, and visual data representations.
* Document-based question answering for educational and enterprise applications.
* Extraction and LaTeX formatting of mathematical expressions from printed or handwritten content.
* Retrieval and summarization from long documents, slides, and multi-modal inputs.
* Multilingual document analysis and structured content extraction for global use cases.
* Robotic or mobile automation with vision-guided contextual interaction.

## Limitations

* May show degraded performance on extremely low-quality or occluded images.
* Not optimized for real-time applications on low-resource or edge devices due to computational demands.
* Variable accuracy on uncommon or low-resource languages or scripts.
* Long video processing may require substantial memory and is not optimized for streaming applications.
* Visual token settings affect performance; suboptimal configurations can impact results.
* In rare cases, outputs may contain hallucinated or contextually misaligned information.

---

## Training Details

| Parameter              | Value                                         |
| ---------------------- | --------------------------------------------- |
| **Dataset Size**       | 230K samples (Modular Combustion of Datasets) |
| **Model Architecture** | `Qwen2_5_VLForConditionalGeneration`          |
| **Total Disk Volume**  | 400,000 MB                                    |
| **Training Time**      | approx. 9,360(±120) seconds (\~2.60 hours)         |
| **Warmup Steps**       | 750                                           |
| **Precision**          | bfloat16                                      |

---

## References

* **DocVLM: Make Your VLM an Efficient Reader**
  [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)

* **YaRN: Efficient Context Window Extension of Large Language Models**
  [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)

* **Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**
  [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)

* **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
  [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)

* **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
  [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)