File size: 7,803 Bytes
9d3e257
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3b9e2b
9d3e257
c3b9e2b
9d3e257
c3b9e2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d3e257
c3b9e2b
 
 
9d3e257
c3b9e2b
 
 
9d3e257
c3b9e2b
 
9d3e257
c3b9e2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d3e257
c3b9e2b
 
 
9d3e257
c3b9e2b
 
 
 
 
 
 
 
9d3e257
 
c3b9e2b
9d3e257
c3b9e2b
9d3e257
c3b9e2b
9d3e257
c3b9e2b
9d3e257
c3b9e2b
 
 
 
9d3e257
c3b9e2b
9d3e257
c3b9e2b
9d3e257
c3b9e2b
 
 
 
 
9d3e257
c3b9e2b
9d3e257
c3b9e2b
9d3e257
c3b9e2b
 
 
 
 
 
9d3e257
c3b9e2b
 
9d3e257
c3b9e2b
 
 
 
 
 
 
 
 
 
 
 
9d3e257
 
c3b9e2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d3e257
c3b9e2b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
license: mit
language:
  - zh
  - en
tags:
  - document-parsing
  - document-understanding
  - document-intelligence
  - ocr
  - layout-analysis
  - table-extraction
  - multimodal
  - vision-language-model
datasets:
  - custom
pipeline_tag: image-text-to-text
library_name: transformers
---


# Dolphin OCR Deployment on Hugging Face Inference Toolkit

This guide provides step-by-step instructions to deploy the **Bytedance Dolphin OCR model** using the **Hugging Face Inference Toolkit** with GPU support.

---

## πŸ”Ή Prerequisites

- Docker installed
- a GPU in your local machine
- A [Hugging Face account](https://huggingface.co/)
- Basic familiarity with command-line tools

---

## πŸ”’ Step 1: Duplicate the Dolphin Model Repository

1. Visit: [https://huggingface.co/spaces/huggingface-projects/repo\_duplicator](https://huggingface.co/spaces/huggingface-projects/repo_duplicator)
2. Enter the source repo, in this case `Bytedance/Dolphin`.
3. Name your new repo: `luquiT4/DolphinInference` (or any name you prefer).

---

## πŸ”’ Step 2: Add the handler to the Model Repository


to in the documentation they mention that this files helps for compatibility https://github.com/huggingface/huggingface-inference-toolkit/#custom-handler-and-dependency-support
- `handler.py` (Custom inference handler)
- `requirements.txt` (Dependencies)

to add them we need to...

1. Add a new file to the new repo:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/67116e3a75abfd0db8e1b154/wlXCsuIQJlMOf-kKG4c0U.png)

2. And paste this:

```python
import base64
import io
from typing import Dict, Any

import torch
from PIL import Image
from transformers import AutoProcessor, VisionEncoderDecoderModel


class EndpointHandler:
    def __init__(self, path=""):
        # Load processor and model from the provided path or model ID
        self.processor = AutoProcessor.from_pretrained(path or "bytedance/Dolphin")
        self.model = VisionEncoderDecoderModel.from_pretrained(path or "bytedance/Dolphin")

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.model.eval()
        self.model = self.model.half()  # Half precision for speed

        self.tokenizer = self.processor.tokenizer

    def decode_base64_image(self, image_base64: str) -> Image.Image:
        image_bytes = base64.b64decode(image_base64)
        return Image.open(io.BytesIO(image_bytes)).convert("RGB")

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        # Check for image input
        if "inputs" not in data:
            return {"error": "No inputs provided"}

        image_input = data["inputs"]

        # Support both base64 image strings and raw images (Hugging Face supports both)
        if isinstance(image_input, str):
            try:
                image = self.decode_base64_image(image_input)
            except Exception as e:
                return {"error": f"Invalid base64 image: {str(e)}"}
        else:
            image = image_input  # Assume PIL-compatible image

        # Optional: Custom prompt (default: text reading)
        prompt = data.get("prompt", "Read text in the image.")
        full_prompt = f"<s>{prompt} <Answer/>"

        # Preprocess inputs
        inputs = self.processor(image, return_tensors="pt")
        pixel_values = inputs.pixel_values.half().to(self.device)

        prompt_ids = self.tokenizer(full_prompt, add_special_tokens=False, return_tensors="pt").input_ids.to(self.device)
        decoder_attention_mask = torch.ones_like(prompt_ids).to(self.device)

        # Inference
        outputs = self.model.generate(
            pixel_values=pixel_values,
            decoder_input_ids=prompt_ids,
            decoder_attention_mask=decoder_attention_mask,
            min_length=1,
            max_length=4096,
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
            use_cache=True,
            bad_words_ids=[[self.tokenizer.unk_token_id]],
            return_dict_in_generate=True,
            do_sample=False,
            num_beams=1,
        )

        sequence = self.tokenizer.batch_decode(outputs.sequences, skip_special_tokens=False)[0]
        # Clean up
        generated_text = sequence.replace(full_prompt, "").replace("<pad>", "").replace("</s>", "").strip()

        return {"text": generated_text}
```
this has been generated using ChatGPT and this sources:
- https://huggingface.co/docs/inference-endpoints/guides/custom_handler (main documentation)
- https://github.com/bytedance/Dolphin/blob/master/demo_page_hf.py (Demo script of Dolphin)
- https://github.com/bytedance/Dolphin/blob/master/demo_element_hf.py (Demo script of Dolphin)
- https://github.com/bytedance/Dolphin/blob/master/deployment/vllm/api_server.py (VLLM implementation of Dolphin)
- https://huggingface.co/philschmid/donut-base-finetuned-cord-v2/blob/main/handler.py (similar model `handler.py`)


in this case it works using only `handler.py` without `requirements.txt`

---

## πŸ”’ Step 3: Build the Hugging Face Inference Toolkit Docker Image

1. Clone the toolkit:

```bash
git clone https://github.com/huggingface/huggingface-inference-toolkit.git
cd huggingface-inference-toolkit
```

2. **Important:** If you are on Windows, use **WSL or Linux** to avoid line-ending issues (`^M: bad interpreter`).

3. Build the GPU Docker image:

```bash
make inference-pytorch-gpu
# on the back will run this
# docker build -t integration-test-pytorch:gpu -f docker/Dockerfile.pytorch .
```

---

## πŸ”’ Step 4: Run the Inference Server with Dolphin Model

```bash
docker run -ti -p 5001:5000 --gpus all \
  -e HF_MODEL_ID=luquiT4/DolphinInference \
  -e HF_TASK=image-to-text \
  integration-test-pytorch:gpu
```

- `HF_MODEL_ID` = your Hugging Face model name
- `HF_TASK` = task type (image-to-text)

---

## πŸ”’ Step 5: Test the Endpoint

1. Send an inference request:

```bash
curl --request POST \
  --url http://localhost:5001/ \
  --header 'accept: application/json' \
  --header 'content-type: application/octet-stream' \
  --data 'C:\path\to\imagewithtext.png'
```

1. Enjoy a successful request

---

## πŸ”’ Step 6 (Coming Soon): Deploy to Azure Serverless Function as an API

- Use **serverless GPU (NC T4 v3)** for low-cost inference.
- Configure **scale-to-zero** in Azure Container Apps to avoid idle GPU charges.
- Monitor with Azure budgets and alerts.


info:
- https://learn.microsoft.com/en-us/azure/container-apps/gpu-image-generation?pivots=azure-portal
- https://azure.microsoft.com/en-us/pricing/details/container-apps/?cdn=disable
- https://learn.microsoft.com/en-us/azure/container-apps/gpu-serverless-overview

---

## πŸ”Ή Troubleshooting

| Issue                       | Solution                                                       |
| --------------------------- | -------------------------------------------------------------- |
| `404 requirements.txt`      | (Optionaal) Create `requirements.txt` on your HF model repo    |
| `Safetensor HeaderTooLarge` | Clone the repo on the cloud using Hugging Face Repo Duplicator |
| `^M bad interpreter`        | Build Docker image on WSL or Linux                             |

---

## πŸ‘ Useful Links

- Dolphin GitHub: [https://github.com/bytedance/Dolphin](https://github.com/bytedance/Dolphin)
- Hugging Face Inference Toolkit: [https://github.com/huggingface/huggingface-inference-toolkit](https://github.com/huggingface/huggingface-inference-toolkit)
- Hugging Face Repo Duplicator: [https://huggingface.co/spaces/huggingface-projects/repo\_duplicator](https://huggingface.co/spaces/huggingface-projects/repo_duplicator)

---

You are now ready to deploy and run Dolphin OCR as a custom Hugging Face Inference Endpoint!