Update README.md

Browse files

Files changed (1) hide show

README.md +185 -44

README.md CHANGED Viewed

@@ -19,72 +19,213 @@ library_name: transformers
 ---
-# Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
-<a href="https://github.com/bytedance/Dolphin"><img src="https://img.shields.io/badge/Code-Github-blue"></a>
-<!--
-<div align="center">
-  <img src="https://cdn.wandeer.world/null/dolphin_demo.gif" width="800">
-</div>
- -->
-## Model Description
-Dolphin (**Do**cument Image **P**arsing via **H**eterogeneous Anchor Prompt**in**g) is a novel multimodal document image parsing model that follows an analyze-then-parse paradigm. It addresses the challenges of complex document understanding through a two-stage approach designed to handle intertwined elements such as text paragraphs, figures, formulas, and tables.
-## 📑 Overview
-Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:
-1. **🔍 Stage 1**: Comprehensive page-level layout analysis by generating element sequence in natural reading order
-2. **🧩 Stage 2**: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts
-<!-- <div align="center">
-  <img src="https://cdn.wandeer.world/null/dolphin_framework.png" width="680">
-</div> -->
-Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.
-## Model Architecture
-Dolphin is built on a vision-encoder-decoder architecture using transformers:
-- **Vision Encoder**: Based on Swin Transformer for extracting visual features from document images
-- **Text Decoder**: Based on MBart for decoding text from visual features
-- **Prompt-based interface**: Uses natural language prompts to control parsing tasks
-The model is implemented as a Hugging Face `VisionEncoderDecoderModel` for easy integration with the Transformers ecosystem.
-## Usage
-Our demo will be released in these days. Please keep tuned! 🔥
-Please refer to our [GitHub repository](https://github.com/bytedance/Dolphin) for detailed usage.
-- [Page-wise parsing](https://github.com/bytedance/Dolphin/demo_page_hf.py): for an entire document image
-- [Element-wise parsing](https://github.com/bytedance/Dolphin/demo_element_hf.py): for an element (paragraph, table, formula) image
-## License
-This model is released under the MIT License.
-## Citation
-```bibtex
-@inproceedings{dolphin2025,
-  title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
-  author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and Tang, Jingqun and Liu, Hao and Huang, Can},
-  year={2025},
-  booktitle={Proceedings of the 65rd Annual Meeting of the Association for Computational Linguistics (ACL)}
-}
 ```
-## Acknowledgements
-This model builds on several open-source projects including:
-- [Hugging Face Transformers](https://github.com/huggingface/transformers)
-- [Donut](https://github.com/clovaai/donut/)
-- [Nougat](https://github.com/facebookresearch/nougat)
-- [Swin Transformer](https://github.com/microsoft/Swin-Transformer)

 ---
+# Dolphin OCR Deployment on Hugging Face Inference Toolkit
+This guide provides step-by-step instructions to deploy the **Bytedance Dolphin OCR model** using the **Hugging Face Inference Toolkit** with GPU support.
+---
+## 🔹 Prerequisites
+- Docker installed
+- a GPU in your local machine
+- A [Hugging Face account](https://huggingface.co/)
+- Basic familiarity with command-line tools
+---
+## 🔢 Step 1: Duplicate the Dolphin Model Repository
+1. Visit: [https://huggingface.co/spaces/huggingface-projects/repo\_duplicator](https://huggingface.co/spaces/huggingface-projects/repo_duplicator)
+2. Enter the source repo, in this case `Bytedance/Dolphin`.
+3. Name your new repo: `luquiT4/DolphinInference` (or any name you prefer).
+---
+## 🔢 Step 2: Add the handler to the Model Repository
+to in the documentation they mention that this files helps for compatibility https://github.com/huggingface/huggingface-inference-toolkit/#custom-handler-and-dependency-support
+- `handler.py` (Custom inference handler)
+- `requirements.txt` (Dependencies)
+to add them we need to...
+1. Add a new file to the new repo:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/67116e3a75abfd0db8e1b154/wlXCsuIQJlMOf-kKG4c0U.png)
+2. And paste this:
+```python
+import base64
+import io
+from typing import Dict, Any
+import torch
+from PIL import Image
+from transformers import AutoProcessor, VisionEncoderDecoderModel
+class EndpointHandler:
+    def __init__(self, path=""):
+        # Load processor and model from the provided path or model ID
+        self.processor = AutoProcessor.from_pretrained(path or "bytedance/Dolphin")
+        self.model = VisionEncoderDecoderModel.from_pretrained(path or "bytedance/Dolphin")
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.model.to(self.device)
+        self.model.eval()
+        self.model = self.model.half()  # Half precision for speed
+        self.tokenizer = self.processor.tokenizer
+    def decode_base64_image(self, image_base64: str) -> Image.Image:
+        image_bytes = base64.b64decode(image_base64)
+        return Image.open(io.BytesIO(image_bytes)).convert("RGB")
+    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
+        # Check for image input
+        if "inputs" not in data:
+            return {"error": "No inputs provided"}
+        image_input = data["inputs"]
+        # Support both base64 image strings and raw images (Hugging Face supports both)
+        if isinstance(image_input, str):
+            try:
+                image = self.decode_base64_image(image_input)
+            except Exception as e:
+                return {"error": f"Invalid base64 image: {str(e)}"}
+        else:
+            image = image_input  # Assume PIL-compatible image
+        # Optional: Custom prompt (default: text reading)
+        prompt = data.get("prompt", "Read text in the image.")
+        full_prompt = f"<s>{prompt} <Answer/>"
+        # Preprocess inputs
+        inputs = self.processor(image, return_tensors="pt")
+        pixel_values = inputs.pixel_values.half().to(self.device)
+        prompt_ids = self.tokenizer(full_prompt, add_special_tokens=False, return_tensors="pt").input_ids.to(self.device)
+        decoder_attention_mask = torch.ones_like(prompt_ids).to(self.device)
+        # Inference
+        outputs = self.model.generate(
+            pixel_values=pixel_values,
+            decoder_input_ids=prompt_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            min_length=1,
+            max_length=4096,
+            pad_token_id=self.tokenizer.pad_token_id,
+            eos_token_id=self.tokenizer.eos_token_id,
+            use_cache=True,
+            bad_words_ids=[[self.tokenizer.unk_token_id]],
+            return_dict_in_generate=True,
+            do_sample=False,
+            num_beams=1,
+        )
+        sequence = self.tokenizer.batch_decode(outputs.sequences, skip_special_tokens=False)[0]
+        # Clean up
+        generated_text = sequence.replace(full_prompt, "").replace("<pad>", "").replace("</s>", "").strip()
+        return {"text": generated_text}
+```
+this has been generated using ChatGPT and this sources:
+- https://huggingface.co/docs/inference-endpoints/guides/custom_handler (main documentation)
+- https://github.com/bytedance/Dolphin/blob/master/demo_page_hf.py (Demo script of Dolphin)
+- https://github.com/bytedance/Dolphin/blob/master/demo_element_hf.py (Demo script of Dolphin)
+- https://github.com/bytedance/Dolphin/blob/master/deployment/vllm/api_server.py (VLLM implementation of Dolphin)
+- https://huggingface.co/philschmid/donut-base-finetuned-cord-v2/blob/main/handler.py (similar model `handler.py`)
+in this case it works using only `handler.py` without `requirements.txt`
+---
+## 🔢 Step 3: Build the Hugging Face Inference Toolkit Docker Image
+1. Clone the toolkit:
+```bash
+git clone https://github.com/huggingface/huggingface-inference-toolkit.git
+cd huggingface-inference-toolkit
+```
+2. **Important:** If you are on Windows, use **WSL or Linux** to avoid line-ending issues (`^M: bad interpreter`).
+3. Build the GPU Docker image:
+```bash
+make inference-pytorch-gpu
+# on the back will run this
+# docker build -t integration-test-pytorch:gpu -f docker/Dockerfile.pytorch .
+```
+---
+## 🔢 Step 4: Run the Inference Server with Dolphin Model
+```bash
+docker run -ti -p 5001:5000 --gpus all \
+  -e HF_MODEL_ID=luquiT4/DolphinInference \
+  -e HF_TASK=image-to-text \
+  integration-test-pytorch:gpu
+```
+- `HF_MODEL_ID` = your Hugging Face model name
+- `HF_TASK` = task type (image-to-text)
+---
+## 🔢 Step 5: Test the Endpoint
+1. Send an inference request:
+```bash
+curl --request POST \
+  --url http://localhost:5001/ \
+  --header 'accept: application/json' \
+  --header 'content-type: application/octet-stream' \
+  --data 'C:\path\to\imagewithtext.png'
 ```
+1. Enjoy a successful request
+---
+## 🔢 Step 6 (Coming Soon): Deploy to Azure Serverless Function as an API
+- Use **serverless GPU (NC T4 v3)** for low-cost inference.
+- Configure **scale-to-zero** in Azure Container Apps to avoid idle GPU charges.
+- Monitor with Azure budgets and alerts.
+info:
+- https://learn.microsoft.com/en-us/azure/container-apps/gpu-image-generation?pivots=azure-portal
+- https://azure.microsoft.com/en-us/pricing/details/container-apps/?cdn=disable
+- https://learn.microsoft.com/en-us/azure/container-apps/gpu-serverless-overview
+---
+## 🔹 Troubleshooting
+| Issue                       | Solution                                                       |
+| --------------------------- | -------------------------------------------------------------- |
+| `404 requirements.txt`      | (Optionaal) Create `requirements.txt` on your HF model repo    |
+| `Safetensor HeaderTooLarge` | Clone the repo on the cloud using Hugging Face Repo Duplicator |
+| `^M bad interpreter`        | Build Docker image on WSL or Linux                             |
+---
+## 👍 Useful Links
+- Dolphin GitHub: [https://github.com/bytedance/Dolphin](https://github.com/bytedance/Dolphin)
+- Hugging Face Inference Toolkit: [https://github.com/huggingface/huggingface-inference-toolkit](https://github.com/huggingface/huggingface-inference-toolkit)
+- Hugging Face Repo Duplicator: [https://huggingface.co/spaces/huggingface-projects/repo\_duplicator](https://huggingface.co/spaces/huggingface-projects/repo_duplicator)
+---
+You are now ready to deploy and run Dolphin OCR as a custom Hugging Face Inference Endpoint!