prithivMLmods commited on
Commit
6e87927
·
verified ·
1 Parent(s): 0edfe9f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -1
README.md CHANGED
@@ -18,4 +18,124 @@ datasets:
18
  license: apache-2.0
19
  ---
20
 
21
- ![Camel.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/IkSg6Ubp8AKrPsXXgD8cQ.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  license: apache-2.0
19
  ---
20
 
21
+ ![Camel.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/IkSg6Ubp8AKrPsXXgD8cQ.png)
22
+
23
+ # **Camel-Doc-OCR-062825**
24
+
25
+ > The **Camel-Doc-OCR-062825** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, optimized for **Document Retrieval**, **Content Extraction**, and **Analysis Recognition**. Built on top of the Qwen2.5-VL architecture, this model enhances document comprehension capabilities with focused training on the Opendoc2-Analysis-Recognition dataset for superior document analysis and information extraction tasks.
26
+
27
+ # Key Enhancements
28
+
29
+ * **Context-Aware Multimodal Extraction and Linking for Documents**: Advanced capability for understanding document context and establishing connections between multimodal elements within documents.
30
+
31
+ * **Enhanced Document Retrieval**: Designed to efficiently locate and extract relevant information from complex document structures and layouts.
32
+
33
+ * **Superior Content Extraction**: Optimized for precise extraction of structured and unstructured content from diverse document formats.
34
+
35
+ * **Analysis Recognition**: Specialized in recognizing and interpreting analytical content, charts, tables, and visual data representations.
36
+
37
+ * **State-of-the-Art Performance Across Resolutions**: Achieves competitive results on OCR and visual QA benchmarks such as DocVQA, MathVista, RealWorldQA, and MTVQA.
38
+
39
+ * **Video Understanding up to 20+ minutes**: Supports detailed comprehension of long-duration videos for content summarization, Q\&A, and multi-modal reasoning.
40
+
41
+ * **Visually-Grounded Device Interaction**: Enables mobile/robotic device operation via visual inputs and text-based instructions using contextual understanding and decision-making logic.
42
+
43
+ # Quick Start with Transformers
44
+
45
+ ```python
46
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
47
+ from qwen_vl_utils import process_vision_info
48
+
49
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
50
+ "prithivMLmods/Camel-Doc-OCR-062825", torch_dtype="auto", device_map="auto"
51
+ )
52
+
53
+ processor = AutoProcessor.from_pretrained("prithivMLmods/Camel-Doc-OCR-062825")
54
+
55
+ messages = [
56
+ {
57
+ "role": "user",
58
+ "content": [
59
+ {
60
+ "type": "image",
61
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
62
+ },
63
+ {"type": "text", "text": "Describe this image."},
64
+ ],
65
+ }
66
+ ]
67
+
68
+ text = processor.apply_chat_template(
69
+ messages, tokenize=False, add_generation_prompt=True
70
+ )
71
+ image_inputs, video_inputs = process_vision_info(messages)
72
+ inputs = processor(
73
+ text=[text],
74
+ images=image_inputs,
75
+ videos=video_inputs,
76
+ padding=True,
77
+ return_tensors="pt",
78
+ )
79
+ inputs = inputs.to("cuda")
80
+
81
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
82
+ generated_ids_trimmed = [
83
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
84
+ ]
85
+ output_text = processor.batch_decode(
86
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
87
+ )
88
+ print(output_text)
89
+ ```
90
+
91
+ ## Training Details
92
+
93
+ | Parameter | Value |
94
+ |-------------------------|-----------------------------------------------------|
95
+ | **Dataset Size** | 108K samples (Modular Combustion of Datasets) |
96
+ | **Model Architecture** | `Qwen2_5_VLForConditionalGeneration` |
97
+ | **Hardware** | 3 × A40 (27 vCPUs) (144 GB VRAM) (150 GB RAM) |
98
+ | **Total Disk Volume** | 300,000 MB |
99
+ | **Training Time** | approx. ~12,897 seconds (~3.58 hours) |
100
+ | **Warmup Steps** | 750 |
101
+ | **Precision** | bfloat16 |
102
+
103
+ # Intended Use
104
+
105
+ This model is intended for:
106
+
107
+ * Context-aware multimodal extraction and linking for complex document structures.
108
+ * High-fidelity document retrieval and content extraction from various document formats.
109
+ * Analysis recognition of charts, graphs, tables, and visual data representations.
110
+ * Document-based question answering for educational and enterprise applications.
111
+ * Extraction and LaTeX formatting of mathematical expressions from printed or handwritten content.
112
+ * Retrieval and summarization from long documents, slides, and multi-modal inputs.
113
+ * Multilingual document analysis and structured content extraction for global use cases.
114
+ * Robotic or mobile automation with vision-guided contextual interaction.
115
+
116
+ # Limitations
117
+
118
+ * May show degraded performance on extremely low-quality or occluded images.
119
+ * Not optimized for real-time applications on low-resource or edge devices due to computational demands.
120
+ * Variable accuracy on uncommon or low-resource languages/scripts.
121
+ * Long video processing may require substantial memory and is not optimized for streaming applications.
122
+ * Visual token settings affect performance; suboptimal configurations can impact results.
123
+ * In rare cases, outputs may contain hallucinated or contextually misaligned information.
124
+
125
+
126
+ ## References
127
+
128
+ - **DocVLM: Make Your VLM an Efficient Reader**
129
+ [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)
130
+
131
+ - **YaRN: Efficient Context Window Extension of Large Language Models**
132
+ [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)
133
+
134
+ - **Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**
135
+ [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)
136
+
137
+ - **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
138
+ [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)
139
+
140
+ - **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
141
+ [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)