luquiT4 commited on
Commit
c3b9e2b
·
verified ·
1 Parent(s): f21c632

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +185 -44
README.md CHANGED
@@ -19,72 +19,213 @@ library_name: transformers
19
  ---
20
 
21
 
22
- # Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
23
 
24
- <a href="https://github.com/bytedance/Dolphin"><img src="https://img.shields.io/badge/Code-Github-blue"></a>
25
 
26
- <!--
27
- <div align="center">
28
- <img src="https://cdn.wandeer.world/null/dolphin_demo.gif" width="800">
29
- </div>
30
- -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- ## Model Description
 
 
33
 
34
- Dolphin (**Do**cument Image **P**arsing via **H**eterogeneous Anchor Prompt**in**g) is a novel multimodal document image parsing model that follows an analyze-then-parse paradigm. It addresses the challenges of complex document understanding through a two-stage approach designed to handle intertwined elements such as text paragraphs, figures, formulas, and tables.
 
 
35
 
36
- ## 📑 Overview
 
37
 
38
- Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
- 1. **🔍 Stage 1**: Comprehensive page-level layout analysis by generating element sequence in natural reading order
41
- 2. **🧩 Stage 2**: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts
 
42
 
43
- <!-- <div align="center">
44
- <img src="https://cdn.wandeer.world/null/dolphin_framework.png" width="680">
45
- </div> -->
 
 
 
 
 
46
 
47
- Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.
48
 
49
- ## Model Architecture
50
 
51
- Dolphin is built on a vision-encoder-decoder architecture using transformers:
52
 
53
- - **Vision Encoder**: Based on Swin Transformer for extracting visual features from document images
54
- - **Text Decoder**: Based on MBart for decoding text from visual features
55
- - **Prompt-based interface**: Uses natural language prompts to control parsing tasks
56
 
57
- The model is implemented as a Hugging Face `VisionEncoderDecoderModel` for easy integration with the Transformers ecosystem.
58
 
59
- ## Usage
 
 
 
60
 
61
- Our demo will be released in these days. Please keep tuned! 🔥
62
 
63
- Please refer to our [GitHub repository](https://github.com/bytedance/Dolphin) for detailed usage.
64
 
65
- - [Page-wise parsing](https://github.com/bytedance/Dolphin/demo_page_hf.py): for an entire document image
66
- - [Element-wise parsing](https://github.com/bytedance/Dolphin/demo_element_hf.py): for an element (paragraph, table, formula) image
 
 
 
67
 
 
68
 
69
- ## License
70
 
71
- This model is released under the MIT License.
 
 
 
 
 
72
 
73
- ## Citation
 
74
 
75
- ```bibtex
76
- @inproceedings{dolphin2025,
77
- title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
78
- author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and Tang, Jingqun and Liu, Hao and Huang, Can},
79
- year={2025},
80
- booktitle={Proceedings of the 65rd Annual Meeting of the Association for Computational Linguistics (ACL)}
81
- }
 
 
 
 
 
82
  ```
83
 
84
- ## Acknowledgements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
- This model builds on several open-source projects including:
87
- - [Hugging Face Transformers](https://github.com/huggingface/transformers)
88
- - [Donut](https://github.com/clovaai/donut/)
89
- - [Nougat](https://github.com/facebookresearch/nougat)
90
- - [Swin Transformer](https://github.com/microsoft/Swin-Transformer)
 
19
  ---
20
 
21
 
22
+ # Dolphin OCR Deployment on Hugging Face Inference Toolkit
23
 
24
+ This guide provides step-by-step instructions to deploy the **Bytedance Dolphin OCR model** using the **Hugging Face Inference Toolkit** with GPU support.
25
 
26
+ ---
27
+
28
+ ## 🔹 Prerequisites
29
+
30
+ - Docker installed
31
+ - a GPU in your local machine
32
+ - A [Hugging Face account](https://huggingface.co/)
33
+ - Basic familiarity with command-line tools
34
+
35
+ ---
36
+
37
+ ## 🔢 Step 1: Duplicate the Dolphin Model Repository
38
+
39
+ 1. Visit: [https://huggingface.co/spaces/huggingface-projects/repo\_duplicator](https://huggingface.co/spaces/huggingface-projects/repo_duplicator)
40
+ 2. Enter the source repo, in this case `Bytedance/Dolphin`.
41
+ 3. Name your new repo: `luquiT4/DolphinInference` (or any name you prefer).
42
+
43
+ ---
44
+
45
+ ## 🔢 Step 2: Add the handler to the Model Repository
46
+
47
+
48
+ to in the documentation they mention that this files helps for compatibility https://github.com/huggingface/huggingface-inference-toolkit/#custom-handler-and-dependency-support
49
+ - `handler.py` (Custom inference handler)
50
+ - `requirements.txt` (Dependencies)
51
+
52
+ to add them we need to...
53
+
54
+ 1. Add a new file to the new repo:
55
+
56
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/67116e3a75abfd0db8e1b154/wlXCsuIQJlMOf-kKG4c0U.png)
57
+
58
+ 2. And paste this:
59
+
60
+ ```python
61
+ import base64
62
+ import io
63
+ from typing import Dict, Any
64
+
65
+ import torch
66
+ from PIL import Image
67
+ from transformers import AutoProcessor, VisionEncoderDecoderModel
68
+
69
+
70
+ class EndpointHandler:
71
+ def __init__(self, path=""):
72
+ # Load processor and model from the provided path or model ID
73
+ self.processor = AutoProcessor.from_pretrained(path or "bytedance/Dolphin")
74
+ self.model = VisionEncoderDecoderModel.from_pretrained(path or "bytedance/Dolphin")
75
+
76
+ self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
77
+ self.model.to(self.device)
78
+ self.model.eval()
79
+ self.model = self.model.half() # Half precision for speed
80
+
81
+ self.tokenizer = self.processor.tokenizer
82
+
83
+ def decode_base64_image(self, image_base64: str) -> Image.Image:
84
+ image_bytes = base64.b64decode(image_base64)
85
+ return Image.open(io.BytesIO(image_bytes)).convert("RGB")
86
+
87
+ def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
88
+ # Check for image input
89
+ if "inputs" not in data:
90
+ return {"error": "No inputs provided"}
91
+
92
+ image_input = data["inputs"]
93
+
94
+ # Support both base64 image strings and raw images (Hugging Face supports both)
95
+ if isinstance(image_input, str):
96
+ try:
97
+ image = self.decode_base64_image(image_input)
98
+ except Exception as e:
99
+ return {"error": f"Invalid base64 image: {str(e)}"}
100
+ else:
101
+ image = image_input # Assume PIL-compatible image
102
 
103
+ # Optional: Custom prompt (default: text reading)
104
+ prompt = data.get("prompt", "Read text in the image.")
105
+ full_prompt = f"<s>{prompt} <Answer/>"
106
 
107
+ # Preprocess inputs
108
+ inputs = self.processor(image, return_tensors="pt")
109
+ pixel_values = inputs.pixel_values.half().to(self.device)
110
 
111
+ prompt_ids = self.tokenizer(full_prompt, add_special_tokens=False, return_tensors="pt").input_ids.to(self.device)
112
+ decoder_attention_mask = torch.ones_like(prompt_ids).to(self.device)
113
 
114
+ # Inference
115
+ outputs = self.model.generate(
116
+ pixel_values=pixel_values,
117
+ decoder_input_ids=prompt_ids,
118
+ decoder_attention_mask=decoder_attention_mask,
119
+ min_length=1,
120
+ max_length=4096,
121
+ pad_token_id=self.tokenizer.pad_token_id,
122
+ eos_token_id=self.tokenizer.eos_token_id,
123
+ use_cache=True,
124
+ bad_words_ids=[[self.tokenizer.unk_token_id]],
125
+ return_dict_in_generate=True,
126
+ do_sample=False,
127
+ num_beams=1,
128
+ )
129
 
130
+ sequence = self.tokenizer.batch_decode(outputs.sequences, skip_special_tokens=False)[0]
131
+ # Clean up
132
+ generated_text = sequence.replace(full_prompt, "").replace("<pad>", "").replace("</s>", "").strip()
133
 
134
+ return {"text": generated_text}
135
+ ```
136
+ this has been generated using ChatGPT and this sources:
137
+ - https://huggingface.co/docs/inference-endpoints/guides/custom_handler (main documentation)
138
+ - https://github.com/bytedance/Dolphin/blob/master/demo_page_hf.py (Demo script of Dolphin)
139
+ - https://github.com/bytedance/Dolphin/blob/master/demo_element_hf.py (Demo script of Dolphin)
140
+ - https://github.com/bytedance/Dolphin/blob/master/deployment/vllm/api_server.py (VLLM implementation of Dolphin)
141
+ - https://huggingface.co/philschmid/donut-base-finetuned-cord-v2/blob/main/handler.py (similar model `handler.py`)
142
 
 
143
 
144
+ in this case it works using only `handler.py` without `requirements.txt`
145
 
146
+ ---
147
 
148
+ ## 🔢 Step 3: Build the Hugging Face Inference Toolkit Docker Image
 
 
149
 
150
+ 1. Clone the toolkit:
151
 
152
+ ```bash
153
+ git clone https://github.com/huggingface/huggingface-inference-toolkit.git
154
+ cd huggingface-inference-toolkit
155
+ ```
156
 
157
+ 2. **Important:** If you are on Windows, use **WSL or Linux** to avoid line-ending issues (`^M: bad interpreter`).
158
 
159
+ 3. Build the GPU Docker image:
160
 
161
+ ```bash
162
+ make inference-pytorch-gpu
163
+ # on the back will run this
164
+ # docker build -t integration-test-pytorch:gpu -f docker/Dockerfile.pytorch .
165
+ ```
166
 
167
+ ---
168
 
169
+ ## 🔢 Step 4: Run the Inference Server with Dolphin Model
170
 
171
+ ```bash
172
+ docker run -ti -p 5001:5000 --gpus all \
173
+ -e HF_MODEL_ID=luquiT4/DolphinInference \
174
+ -e HF_TASK=image-to-text \
175
+ integration-test-pytorch:gpu
176
+ ```
177
 
178
+ - `HF_MODEL_ID` = your Hugging Face model name
179
+ - `HF_TASK` = task type (image-to-text)
180
 
181
+ ---
182
+
183
+ ## 🔢 Step 5: Test the Endpoint
184
+
185
+ 1. Send an inference request:
186
+
187
+ ```bash
188
+ curl --request POST \
189
+ --url http://localhost:5001/ \
190
+ --header 'accept: application/json' \
191
+ --header 'content-type: application/octet-stream' \
192
+ --data 'C:\path\to\imagewithtext.png'
193
  ```
194
 
195
+ 1. Enjoy a successful request
196
+
197
+ ---
198
+
199
+ ## 🔢 Step 6 (Coming Soon): Deploy to Azure Serverless Function as an API
200
+
201
+ - Use **serverless GPU (NC T4 v3)** for low-cost inference.
202
+ - Configure **scale-to-zero** in Azure Container Apps to avoid idle GPU charges.
203
+ - Monitor with Azure budgets and alerts.
204
+
205
+
206
+ info:
207
+ - https://learn.microsoft.com/en-us/azure/container-apps/gpu-image-generation?pivots=azure-portal
208
+ - https://azure.microsoft.com/en-us/pricing/details/container-apps/?cdn=disable
209
+ - https://learn.microsoft.com/en-us/azure/container-apps/gpu-serverless-overview
210
+
211
+ ---
212
+
213
+ ## 🔹 Troubleshooting
214
+
215
+ | Issue | Solution |
216
+ | --------------------------- | -------------------------------------------------------------- |
217
+ | `404 requirements.txt` | (Optionaal) Create `requirements.txt` on your HF model repo |
218
+ | `Safetensor HeaderTooLarge` | Clone the repo on the cloud using Hugging Face Repo Duplicator |
219
+ | `^M bad interpreter` | Build Docker image on WSL or Linux |
220
+
221
+ ---
222
+
223
+ ## 👍 Useful Links
224
+
225
+ - Dolphin GitHub: [https://github.com/bytedance/Dolphin](https://github.com/bytedance/Dolphin)
226
+ - Hugging Face Inference Toolkit: [https://github.com/huggingface/huggingface-inference-toolkit](https://github.com/huggingface/huggingface-inference-toolkit)
227
+ - Hugging Face Repo Duplicator: [https://huggingface.co/spaces/huggingface-projects/repo\_duplicator](https://huggingface.co/spaces/huggingface-projects/repo_duplicator)
228
+
229
+ ---
230
 
231
+ You are now ready to deploy and run Dolphin OCR as a custom Hugging Face Inference Endpoint!