dots.vlm1
🤗 Hugging Face | 📄 Blog | 🔗 GitHub
🖥️ Demo | 💬 WeChat (微信) | 📕 rednote
Visit our Hugging Face (click links above) or check out our live demo to try dots.vlm1! Enjoy!
1. Introduction
We are excited to introduce dots.vlm1, the first vision-language model in the dots model family. Built upon a 1.2 billion-parameter vision encoder and the DeepSeek V3 large language model (LLM), dots.vlm1 demonstrates strong multimodal understanding and reasoning capabilities.
Model Highlights:
- NaViT Vision Encoder: Trained entirely from scratch rather than fine-tuning an existing vision backbone. It natively supports dynamic resolution and incorporates pure visual supervision in addition to traditional text supervision, thereby enhancing the upper bound of perceptual capacity. Beyond image captioning datasets, a large amount of structured image data was introduced during pretraining to improve the model’s perceptual capabilities—particularly for tasks such as OCR.
- Multimodal Training Data: In addition to conventional approaches, dots.vlm1 leverages a wide range of synthetic data strategies to cover diverse image types (e.g., tables, charts, documents, graphics) and descriptions (e.g., alt text, dense captions, grounding annotations). Furthermore, a strong multimodal model was used to rewrite web page data with interleaved text and images, significantly improving the quality of the training corpus.
Through large-scale pretraining and carefully tuned post-training, dots.vlm1 achieves near state-of-the-art performance in both visual perception and reasoning, setting a new performance ceiling for open-source vision-language models—while still maintaining competitive capabilities in pure-text tasks.
Special thanks to the DeepSeek team for the excellent DeepSeek V3 model.
2. Performance
Qwen2.5VL-72B | Gemini2.5 Pro | Seed-VL1.5 thinking | dots.vlm1 | ||
---|---|---|---|---|---|
STEM/Reasoning | MMMU | 69.3 | 84.22 | 79.89 | 80.11 |
MMMU_pro | 51.91 | 76.5 | 68.9 | 70.11 | |
MathVision | 39.4 | 72.34 | 68.77 | 69.64 | |
MathVista | 74.6 | 83.5 | 86.1 | 85.0 | |
ZeroBench | 2 | 5 | 2 | 4 | |
ZeroBench-sub | 20 | 30.24 | 25.75 | 26.65 | |
VisuLogic | 25.6 | 29.8 | 35.9 | 32.2 | |
General Visual | MMbench-CN | 88.2 | 89 | 89.78 | 88.24 |
MMbench-EN | 89.2 | 89.55 | 89.47 | 89.32 | |
MMStar | 71.13 | 78.73 | 78.33 | 76.67 | |
RealWorldQA | 75.9 | 78.43 | 78.69 | 79.08 | |
Vibe(GPT4o) | 60.13 | 76.39 | 68.59 | 69.24 | |
m3gia(cn) | 88.24 | 89.54 | 91.2 | 90.85 | |
SimpleVQA_ds | 52.19 | 57.09 | 61.34 | 55.8 | |
MMVP | 66 | 67.33 | 73.33 | 72 | |
HallusionBench | 56.5 | 63.07 | 63.49 | 64.83 | |
CVBench | 84.15 | 85.36 | 89.68 | 85.65 | |
Blink | 61.7 | 71.86 | 72.38 | 66.33 | |
OCR/Doc/Chart | charxiv(dq) | 88.2 | 90.3 | 89.6 | 92.1 |
charxiv(rq) | 48.5 | 68.3 | 63.4 | 64.4 | |
OCRReasoning | 38.02 | 70.81 | 63.42 | 66.23 | |
DOCVQA | 96.23 | 95.42 | 93.65 | 96.52 | |
ChartQA | 86.1 | 86.16 | 86.88 | 87.68 | |
OCRBenchV1 | 87.1 | 86.6 | 86.7 | 82.3 | |
AI2D | 88.3 | 91.03 | 89.05 | 88.37 | |
Grounding/Counting | RefCOCO | 90.3 | 74.6 | 91.3 | 90.45 |
CountBench | 92.4 | 91.79 | 89 | 91.99 | |
Multi Image | muir | 69.38 | 70.5 | 79.77 | 78.58 |
mantis | 79.26 | 84.33 | 82.3 | 86.18 | |
Deepseek-R1-0528 | Qwen3-235B-A22B | Qwen3-235B-A22B-think-2507 | dots.vlm1 | ||
Text | LiveCodeBench | 73.3 | 70.7 | 78.4 | 72.94 |
AIME 2025 | 87.5 | 82.6 | 92.3 | 85.83 | |
GPQA | 81 | 70.7 | 81.1 | 72.78 |
3. Usage
Environment Setup
You have two options to set up the environment:
Option 1: Using Base Image + Manual Installation
# Use the base SGLang image
docker run -it --gpus all lmsysorg/sglang:v0.4.9.post1-cu126
# Clone and install our custom SGLang branch
# IMPORTANT: Only our specific SGLang version supports dots.vlm1 models
# NOTE: This installation must be done on EVERY node in your cluster
# We have submitted a PR to the main SGLang repository (currently under review):
# https://github.com/sgl-project/sglang/pull/8778
git clone --branch dots.vlm1.v1 https://github.com/rednote-hilab/sglang sglang
pip install -e sglang/python
Option 2: Using Pre-built Image (Recommended)
# Use our pre-built image with dots.vlm1 support
docker run -it --gpus all rednotehilab/dots.vlm1_sglang:v0.4.9.post1-cu126
Multi-Node Deployment
Our model supports distributed deployment across multiple machines. Here's how to set up a 2-node cluster:
Prerequisites:
- Model:
rednote-hilab/dots.vlm1.inst
- Node 1 IP:
10.0.0.1
(master node) - Node 2 IP:
10.0.0.2
(worker node)
Node 1 (Master - rank 0):
# Recommend downloading model locally to avoid timeout during startup
# Use: huggingface-cli download rednote-hilab/dots.vlm1.inst --local-dir ./dots.vlm1.inst
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst" # or local path like ./dots.vlm1.inst
# Get actual IP address: hostname -I | awk '{print $1}' or ip route get 1 | awk '{print $7}'
export MASTER_IP="10.0.0.1" # Replace with actual master node IP
export API_PORT=15553
python3 -m sglang.launch_server \
--model-path $HF_MODEL_PATH \
--tp 16 \
--dist-init-addr $MASTER_IP:23456 \
--nnodes 2 \
--node-rank 0 \
--trust-remote-code \
--host 0.0.0.0 \
--port $API_PORT \
--context-length 65536 \
--max-running-requests 64 \
--disable-radix-cache \
--mem-fraction-static 0.8 \
--chunked-prefill-size -1 \
--chat-template dots-vlm \
--cuda-graph-max-bs 64 \
--quantization fp8
Node 2 (Worker - rank 1):
# Use the same variables as defined in Node 1
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"
export MASTER_IP="10.0.0.1" # Must match Node 1
export API_PORT=15553
python3 -m sglang.launch_server \
--model-path $HF_MODEL_PATH \
--tp 16 \
--dist-init-addr $MASTER_IP:23456 \
--nnodes 2 \
--node-rank 1 \
--trust-remote-code \
--host 0.0.0.0 \
--port $API_PORT \
--context-length 65536 \
--max-running-requests 64 \
--disable-radix-cache \
--mem-fraction-static 0.8 \
--chunked-prefill-size -1 \
--chat-template dots-vlm \
--cuda-graph-max-bs 64 \
--quantization fp8
Configuration Parameters
Key parameters explanation:
--tp 16
: Tensor parallelism across 16 GPUs per node--nnodes 2
: Total number of nodes in the cluster--node-rank
: Node identifier (0 for master, 1+ for workers)--context-length 65536
: Maximum context length--quantization fp8
: Use FP8 quantization for efficiency--chat-template dots-vlm
: Use custom chat template for dots.vlm model
API Usage
Once the servers are launched, you can access the model through OpenAI-compatible API:
# Use the same MASTER_IP and API_PORT as defined above
curl -X POST http://$MASTER_IP:$API_PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Please briefly describe this image"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
],
"temperature": 0.1,
"top_p": 0.9,
"max_tokens": 55000
}'
- Downloads last month
- 4,312