prithivMLmods commited on
Commit
a5308d0
·
verified ·
1 Parent(s): 8bea7b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -3
README.md CHANGED
@@ -21,7 +21,73 @@ tags:
21
 
22
  ![11.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/06COvqws8RSPQVm51EQgh.png)
23
 
24
- ## Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  | Parameter | Value |
27
  |-------------------------|----------------------------------------------------|
@@ -38,7 +104,27 @@ tags:
38
  > [!note]
39
  > The open dataset image-text response will be updated soon.
40
 
41
- ## References
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  - **DocVLM: Make Your VLM an Efficient Reader**
44
  [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)
@@ -54,4 +140,4 @@ tags:
54
 
55
  - **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
56
  [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)
57
-
 
21
 
22
  ![11.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/06COvqws8RSPQVm51EQgh.png)
23
 
24
+ # **coreOCR-7B-050325-preview**
25
+
26
+ > The **coreOCR-7B-050325-preview** model is a fine-tuned version of **Qwen/Qwen2-VL-7B**, optimized for **Document-Level Optical Character Recognition (OCR)**, **long-context vision-language understanding**, and **accurate image-to-text conversion with mathematical LaTeX formatting**. Designed with a focus on high-fidelity visual-textual comprehension, this model enhances document parsing, structured data extraction, and complex visual reasoning.
27
+
28
+ # Key Enhancements
29
+
30
+ * **Advanced Document-Level OCR**: Accurately processes and extracts structured text from complex, multi-page documents including invoices, forms, and research papers.
31
+
32
+ * **Enhanced Long-Context Vision-Language Understanding**: Supports long-text retrieval and reasoning from documents and multimedia inputs, including dense text blocks, diagrams, and math content.
33
+
34
+ * **SoTA Understanding Across Image Resolutions**: Achieves state-of-the-art results on visual benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA.
35
+
36
+ * **Video Comprehension up to 20+ minutes**: Capable of high-quality video-based question answering, dialogue generation, and content summarization from long video sequences.
37
+
38
+ * **Device Control via Visual Commands**: With complex reasoning and perception capabilities, it can be integrated with devices like mobile phones or robots for visually grounded automation.
39
+
40
+ * **Multilingual OCR Support**: Recognizes and extracts text from images in multiple languages including English, Chinese, Arabic, Japanese, Korean, Vietnamese, and most European languages.
41
+
42
+ # Quick Start with Transformers
43
+
44
+ ```python
45
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
46
+ from qwen_vl_utils import process_vision_info
47
+
48
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
49
+ "prithivMLmods/coreOCR-7B-050325-preview", torch_dtype="auto", device_map="auto"
50
+ )
51
+
52
+ processor = AutoProcessor.from_pretrained("prithivMLmods/coreOCR-7B-050325-preview")
53
+
54
+ messages = [
55
+ {
56
+ "role": "user",
57
+ "content": [
58
+ {
59
+ "type": "image",
60
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
61
+ },
62
+ {"type": "text", "text": "Describe this image."},
63
+ ],
64
+ }
65
+ ]
66
+
67
+ text = processor.apply_chat_template(
68
+ messages, tokenize=False, add_generation_prompt=True
69
+ )
70
+ image_inputs, video_inputs = process_vision_info(messages)
71
+ inputs = processor(
72
+ text=[text],
73
+ images=image_inputs,
74
+ videos=video_inputs,
75
+ padding=True,
76
+ return_tensors="pt",
77
+ )
78
+ inputs = inputs.to("cuda")
79
+
80
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
81
+ generated_ids_trimmed = [
82
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
83
+ ]
84
+ output_text = processor.batch_decode(
85
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
86
+ )
87
+ print(output_text)
88
+ ```
89
+
90
+ # Training Details
91
 
92
  | Parameter | Value |
93
  |-------------------------|----------------------------------------------------|
 
104
  > [!note]
105
  > The open dataset image-text response will be updated soon.
106
 
107
+ # Intended Use
108
+
109
+ This model is intended for:
110
+
111
+ * Document analysis and OCR from scanned images, PDFs, and camera input.
112
+ * Image-based question answering (e.g., educational content, diagrams, receipts).
113
+ * Math problem solving and LaTeX text generation from handwritten or printed math content.
114
+ * Long-context vision-text applications such as multi-slide document retrieval and dense information extraction.
115
+ * Multilingual OCR workflows for cross-lingual business documents and global data digitization.
116
+ * AI agents for mobile/robotic interaction through visual context.
117
+
118
+ # Limitations
119
+
120
+ * Performance may degrade on extremely noisy or low-resolution images.
121
+ * Not suitable for real-time inference on edge devices due to model size and memory demands.
122
+ * While multilingual, performance on low-resource or rare scripts may vary.
123
+ * Not optimized for high-speed processing of video streams in constrained environments.
124
+ * Contextual understanding depends on visual tokenization parameters; improper configuration may affect output quality.
125
+ * Outputs may occasionally include hallucinations or incomplete answers in long-context queries.
126
+
127
+ # References
128
 
129
  - **DocVLM: Make Your VLM an Efficient Reader**
130
  [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)
 
140
 
141
  - **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
142
  [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)
143
+