File size: 8,464 Bytes
67f1007
 
 
582bcd5
67f1007
582bcd5
 
67f1007
 
 
 
 
d63aa81
67f1007
 
 
582bcd5
67f1007
 
 
 
582bcd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2c505b
6eb0b08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
582bcd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2eb02c8
 
 
 
 
 
 
 
 
582bcd5
 
 
 
 
2eb02c8
53aba06
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
base_model: unsloth/Qwen2-VL-2B-Instruct
tags:
- text-generation-inference,text-extraction
- transformers
- unsloth/Qwen2-VL-2B-Instruct-16Bit
Base Model: unsloth/Qwen2-VL-2B-Instruct-16Bit
license: apache-2.0
language:
- en
---

# Uploaded finetuned  model

- **Developed by:** JackChew
- **License:** apache-2.0
- **Finetuned from model :** unsloth/Qwen2-VL-2B-Instruct-16Bit

This qwen2_vl model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)


## Model Description
**通义千问 QWEN OCR** is a proprietary model focused on text extraction, specifically designed for extracting text from images of documents, tables, and payslips. The primary goal of this model is to extract COMPLETE/FULL text from images while ensuring that no information is missed.

Qwen2-VL-2B-OCR is a fine-tuned variant of unsloth/Qwen2-VL-2B-Instruct, optimized specifically for Optical Character Recognition (OCR). This model is trained to extract full and complete text from images, with a focus on documents such as payslips, invoices, and tables. The model aims to provide accurate text extraction with minimal loss of information, ensuring that every detail is captured.
This model uses cutting-edge techniques for text-to-text generation from images and works seamlessly for various OCR tasks, including text from complex documents with structured layouts.

## Intended Use
The primary purpose of the model is to extract data from images or documents, especially from payslips and tables, without missing any critical details. It can be applied in various domains such as payroll systems, finance, legal document analysis, and any field where document extraction is required.
Prompt Example:
- **text**: The model will BEST WORK to this `"Extract all text from image/payslip without miss anything"`.


## Model Benchmark
Benchmark Results

| Benchmark           | InternVL2-2B | MiniCPM-V 2.0 | Qwen2-VL-2B |
|---------------------|--------------|----------------|-------------|
| MMMUval             | 36.3         | 38.2           | 41.1        |
| DocVQAtest          | 86.9         | -              | 90.1        |
| InfoVQAtest         | 58.9         | -              | 65.5        |
| ChartQAtest         | 76.2         | -              | 73.5        |
| TextVQAval          | 73.4         | -              | 79.7        |
| OCRBench            | 781          | 605            | 794         |
| MTVQA               | -            | -              | 20.0        |
| VCRen easy          | -            | -              | 81.45       |
| VCRzh easy          | -            | -              | 46.16       |
| RealWorldQA         | 57.3         | 55.8           | 62.9        |
| MMEsum              | 1876.8       | 1808.6         | 1872.0      |
| MMBench-ENtest      | 73.2         | 69.1           | 74.9        |
| MMBench-CNtest      | 70.9         | 66.5           | 73.5        |
| MMBench-V1.1test    | 69.6         | 65.8           | 72.2        |
| MMT-Benchtest       | -            | -              | 54.5        |
| MMStar              | 49.8         | 39.1           | 48.0        |
| MMVetGPT-4-Turbo    | 39.7         | 41.0           | 49.5        |
| HallBenchavg        | 38.0         | 36.1           | 41.7        |
| MathVistatestmini   | 46.0         | 39.8           | 43.0        |
| MathVision          | -            | -              | 12.4        |


After fine-tuning, this model has significantly improved in extracting all relevant sections from the payslip, including the previously missing **Deductions** section.

### Example Output Comparison

![image/png](https://cdn-uploads.huggingface.co/production/uploads/676ed40d25c39d8bd5d6f759/KOAZouqb1qH7toZO6YZsO.png)

#### Fine-tuned Model:
Here is the extracted data from the payslip:

**Employee Information:**
- Date of Joining: 2018-06-23
- Pay Period: August 2021
- Employee Name: Sally Harley
- Designation: Marketing Executive
- Department: Marketing

**Earnings:**
| Earnings         | Amount | Deductions        | Amount |
|------------------|--------|-------------------|--------|
| Basic            | 10000  | Provident Fund    | 1200   |
| Incentive        | 1000   | Professional Tax  | 500    |
| House Rent       | 400    | Loan              | 400    |
| Meal Allowance   | 200    |                   | 9500   |

**Total Earnings:** $11,600  
**Total Deductions:** $2,100  
**Net Pay:** $9,500

**Employer Signature**  
**Employee Signature**

---
#### Original Model:
The original model extracted the following data but missed the **Deductions** section:

- **Date of Joining**: 2018-06-23
- **Pay Period**: August 2021
- **Employee Name**: Sally Harley
- **Designation**: Marketing Executive
- **Department**: Marketing
- **Earnings**:
  - Basic: $10,000
  - Incentive Pay: $1,000
  - House Rent Allowance: $400
  - Meal Allowance: $200
- **Total Earnings**: $11,600
- **Total Deductions**: $2,100
- **Net Pay**: $9,500
- **Employer Signature**: [Signature]
- **Employee Signature**: [Signature]
- **This is system-generated payslip**


## Quick Start
Here’s an example code snippet to get started with this model:

### Loading the Model and Processor
```python
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("JackChew/Qwen2-VL-2B-OCR")
model = AutoModelForImageTextToText.from_pretrained("JackChew/Qwen2-VL-2B-OCR")

```
### Loading an Image
```python
# Load your image
from PIL import Image
image_path = "xxxxx"  # Replace with your image path
image = Image.open(image_path)
```
### Preparing the Model, Preprocessing Inputs, and Performing Inference
```python
import requests
import torch
from torchvision import io
from typing import Dict

model = model.to("cuda")
conversation = [
    {
        "role":"user",
        "content":[
            {
                "type":"image",
            },
            {
                "type":"text",
                "text":"extract all data from this payslip without miss anything"
            }
        ]
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
inputs = inputs.to('cuda')

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)
```

### Handling CUDA Memory Issues During Inference

If you encounter CUDA memory issues during model inference, a common solution is to resize the input image to reduce its size. This helps in reducing the memory footprint and allows the model to process the image more efficiently.

```python
# Resize the image to reduce its size (e.g., scale to half its original size)
image = image.resize((image.width // 2, image.height // 2))
```

## Model Fine-Tuning Details
The model was fine-tuned using the Unsloth framework, which accelerated training by 2x using Huggingface's TRL (Training Reinforcement Learning) library. LoRA (Low-Rank Adaptation) was applied to fine-tune only a small subset of the parameters, which significantly reduces training time and computational resources. Fine-tuning focused on both vision and language layers, ensuring that the model could handle complex OCR tasks efficiently.

Total Trainable Parameters: 57,901,056

## Hardware Requirements
To run this model, it is recommended to have access to a GPU with at least 16 GB of VRAM. Training requires significant memory, so smaller batch sizes or gradient accumulation may be necessary for GPUs with less memory.

### Model Architecture

If you'd like to learn more about the model's architecture and its detailed specifications, you can view the source page on Hugging Face at the following link:

[Qwen2-VL-2B-Instruct Model Page](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)