File size: 6,275 Bytes
d9dbfa0 0edfe9f d9dbfa0 723277e 16efd11 e018d4a 795a4c9 5dd36ae 0edfe9f e7fd3f3 6e87927 eaa5fb0 6e87927 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
language:
- en
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- Document
- KIE
- OCR
- VL
- Camel
- Openpdf
- text-generation-inference
- Extraction
- Linking
- Markdown
- .Md
datasets:
- prithivMLmods/OpenDoc-Pdf-Preview
- prithivMLmods/Opendoc1-Analysis-Recognition
- allenai/olmOCR-mix-0225
- prithivMLmods/Openpdf-Analysis-Recognition
license: apache-2.0
---

# **Camel-Doc-OCR-062825**
> The **Camel-Doc-OCR-062825** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, optimized for **Document Retrieval**, **Content Extraction**, and **Analysis Recognition**. Built on top of the Qwen2.5-VL architecture, this model enhances document comprehension capabilities with focused training on the Opendoc2-Analysis-Recognition dataset for superior document analysis and information extraction tasks.
# Key Enhancements
* **Context-Aware Multimodal Extraction and Linking for Documents**: Advanced capability for understanding document context and establishing connections between multimodal elements within documents.
* **Enhanced Document Retrieval**: Designed to efficiently locate and extract relevant information from complex document structures and layouts.
* **Superior Content Extraction**: Optimized for precise extraction of structured and unstructured content from diverse document formats.
* **Analysis Recognition**: Specialized in recognizing and interpreting analytical content, charts, tables, and visual data representations.
* **State-of-the-Art Performance Across Resolutions**: Achieves competitive results on OCR and visual QA benchmarks such as DocVQA, MathVista, RealWorldQA, and MTVQA.
* **Video Understanding up to 20+ minutes**: Supports detailed comprehension of long-duration videos for content summarization, Q\&A, and multi-modal reasoning.
* **Visually-Grounded Device Interaction**: Enables mobile/robotic device operation via visual inputs and text-based instructions using contextual understanding and decision-making logic.
# Quick Start with Transformers
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Camel-Doc-OCR-062825", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/Camel-Doc-OCR-062825")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
# Intended Use
This model is intended for:
* Context-aware multimodal extraction and linking for complex document structures.
* High-fidelity document retrieval and content extraction from various document formats.
* Analysis recognition of charts, graphs, tables, and visual data representations.
* Document-based question answering for educational and enterprise applications.
* Extraction and LaTeX formatting of mathematical expressions from printed or handwritten content.
* Retrieval and summarization from long documents, slides, and multi-modal inputs.
* Multilingual document analysis and structured content extraction for global use cases.
* Robotic or mobile automation with vision-guided contextual interaction.
# Limitations
* May show degraded performance on extremely low-quality or occluded images.
* Not optimized for real-time applications on low-resource or edge devices due to computational demands.
* Variable accuracy on uncommon or low-resource languages/scripts.
* Long video processing may require substantial memory and is not optimized for streaming applications.
* Visual token settings affect performance; suboptimal configurations can impact results.
* In rare cases, outputs may contain hallucinated or contextually misaligned information.
## Training Details
| Parameter | Value |
|-------------------------|-----------------------------------------------------|
| **Dataset Size** | 108K samples (Modular Combustion of Datasets) |
| **Model Architecture** | `Qwen2_5_VLForConditionalGeneration` |
| **Total Disk Volume** | 300,000 MB |
| **Training Time** | approx. 12,897 seconds (~3.58 hours) |
| **Warmup Steps** | 750 |
| **Precision** | bfloat16 |
## References
- **DocVLM: Make Your VLM an Efficient Reader**
[https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)
- **YaRN: Efficient Context Window Extension of Large Language Models**
[https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)
- **Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**
[https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)
- **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
[https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)
- **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
[https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210) |