
🤗 GitHub | 🤖 Demo | 📑 Technical Report
Introduction

![]() |
![]() |
![]() |
![]() |
report | chemistry | paper | handwritten |
Logics-Parsing is a powerful, end-to-end document parsing model built upon a general Vision-Language Model (VLM) through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). It excels at accurately analyzing and structuring highly complex documents.
Key Features
Effortless End-to-End Processing
- Our single-model architecture eliminates the need for complex, multi-stage pipelines. Deployment and inference are straightforward, going directly from a document image to structured output.
- It demonstrates exceptional performance on documents with challenging layouts.
Advanced Content Recognition
- It accurately recognizes and structures difficult content, including intricate scientific formulas.
- Chemical structures are intelligently identified and can be represented in the standard SMILES format.
Rich, Structured HTML Output
- The model generates a clean HTML representation of the document, preserving its logical structure.
- Each content block (e.g., paragraph, table, figure, formula) is tagged with its category, bounding box coordinates, and OCR text.
- It automatically identifies and filters out irrelevant elements like headers and footers, focusing only on the core content.
State-of-the-Art Performance
- Logics-Parsing achieves the best performance on our in-house benchmark, which is specifically designed to comprehensively evaluate a model’s parsing capability on complex-layout documents and STEM content.
Benchmark
Existing document-parsing benchmarks often provide limited coverage of complex layouts and STEM content. To address this, we constructed an in-house benchmark comprising 1,078 page-level images across nine major categories and over twenty sub-categories. Our model achieves the best performance on this benchmark.

Model Type | Methods | Overall Edit ↓ | Text Edit Edit ↓ | Formula Edit ↓ | Table TEDS ↑ | Table Edit ↓ | ReadOrderEdit ↓ | ChemistryEdit ↓ | HandWritingEdit ↓ | ||||||
EN | ZH | EN | ZH | EN | ZH | EN | ZH | EN | ZH | EN | ZH | ALL | ALL | ||
Pipeline Tools | doc2x | 0.209 | 0.188 | 0.128 | 0.194 | 0.377 | 0.321 | 81.1 | 85.3 | 0.148 | 0.115 | 0.146 | 0.122 | 1.0 | 0.307 |
Textin | 0.153 | 0.158 | 0.132 | 0.190 | 0.185 | 0.223 | 76.7 | 86.3 | 0.176 | 0.113 | 0.118 | 0.104 | 1.0 | 0.344 | |
mathpix* | 0.128 | 0.146 | 0.128 | 0.152 | 0.06 | 0.142 | 86.2 | 86.6 | 0.120 | 0.127 | 0.204 | 0.164 | 0.552 | 0.263 | |
PP_StructureV3 | 0.220 | 0.226 | 0.172 | 0.29 | 0.272 | 0.276 | 66 | 71.5 | 0.237 | 0.193 | 0.201 | 0.143 | 1.0 | 0.382 | |
Mineru2 | 0.212 | 0.245 | 0.134 | 0.195 | 0.280 | 0.407 | 67.5 | 71.8 | 0.228 | 0.203 | 0.205 | 0.177 | 1.0 | 0.387 | |
Marker | 0.324 | 0.409 | 0.188 | 0.289 | 0.285 | 0.383 | 65.5 | 50.4 | 0.593 | 0.702 | 0.23 | 0.262 | 1.0 | 0.50 | |
Pix2text | 0.447 | 0.547 | 0.485 | 0.577 | 0.312 | 0.465 | 64.7 | 63.0 | 0.566 | 0.613 | 0.424 | 0.534 | 1.0 | 0.95 | |
Expert VLMs | Dolphin | 0.208 | 0.256 | 0.149 | 0.189 | 0.334 | 0.346 | 72.9 | 60.1 | 0.192 | 0.35 | 0.160 | 0.139 | 0.984 | 0.433 |
dots.ocr | 0.186 | 0.198 | 0.115 | 0.169 | 0.291 | 0.358 | 79.5 | 82.5 | 0.172 | 0.141 | 0.165 | 0.123 | 1.0 | 0.255 | |
MonkeyOcr | 0.193 | 0.259 | 0.127 | 0.236 | 0.262 | 0.325 | 78.4 | 74.7 | 0.186 | 0.294 | 0.197 | 0.180 | 1.0 | 0.623 | |
OCRFlux | 0.252 | 0.254 | 0.134 | 0.195 | 0.326 | 0.405 | 58.3 | 70.2 | 0.358 | 0.260 | 0.191 | 0.156 | 1.0 | 0.284 | |
Gotocr | 0.247 | 0.249 | 0.181 | 0.213 | 0.231 | 0.318 | 59.5 | 74.7 | 0.38 | 0.299 | 0.195 | 0.164 | 0.969 | 0.446 | |
Olmocr | 0.341 | 0.382 | 0.125 | 0.205 | 0.719 | 0.766 | 57.1 | 56.6 | 0.327 | 0.389 | 0.191 | 0.169 | 1.0 | 0.294 | |
SmolDocling | 0.657 | 0.895 | 0.486 | 0.932 | 0.859 | 0.972 | 18.5 | 1.5 | 0.86 | 0.98 | 0.413 | 0.695 | 1.0 | 0.927 | |
Logics-Parsing | 0.124 | 0.145 | 0.089 | 0.139 | 0.106 | 0.165 | 76.6 | 79.5 | 0.165 | 0.166 | 0.136 | 0.113 | 0.519 | 0.252 | |
General VLMs | Qwen2VL-72B | 0.298 | 0.342 | 0.142 | 0.244 | 0.431 | 0.363 | 64.2 | 55.5 | 0.425 | 0.581 | 0.193 | 0.182 | 0.792 | 0.359 |
Qwen2.5VL-72B | 0.233 | 0.263 | 0.162 | 0.24 | 0.251 | 0.257 | 69.6 | 67 | 0.313 | 0.353 | 0.205 | 0.204 | 0.597 | 0.349 | |
Doubao-1.6 | 0.188 | 0.248 | 0.129 | 0.219 | 0.273 | 0.336 | 74.9 | 69.7 | 0.180 | 0.288 | 0.171 | 0.148 | 0.601 | 0.317 | |
GPT-5 | 0.242 | 0.373 | 0.119 | 0.36 | 0.398 | 0.456 | 67.9 | 55.8 | 0.26 | 0.397 | 0.191 | 0.28 | 0.88 | 0.46 | |
Gemini2.5 pro | 0.185 | 0.20 | 0.115 | 0.155 | 0.288 | 0.326 | 82.6 | 80.3 | 0.154 | 0.182 | 0.181 | 0.136 | 0.535 | 0.26 |
Quick Start
1. Installation
conda create -n logis-parsing python=3.10
conda activate logis-parsing
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
2. Download Model Weights
# Download our model from Modelscope.
pip install modelscope
python download_model.py -t modelscope
# Download our model from huggingface.
pip install huggingface_hub
python download_model.py -t huggingface
3. Inference
python3 inference.py --image_path PATH_TO_INPUT_IMG --output_path PATH_TO_OUTPUT --model_path PATH_TO_MODEL
Acknowledgments
We would like to acknowledge the following open-source projects that provided inspiration and reference for this work:
- Downloads last month
- 121