dots.vlm1

🤗 Hugging Face | 📄 Blog | 🔗 GitHub
🖥️ Demo | 💬 WeChat (微信) | 📕 rednote

Visit our Hugging Face (click links above) or check out our live demo to try dots.vlm1! Enjoy!

1. Introduction

We are excited to introduce dots.vlm1, the first vision-language model in the dots model family. Built upon a 1.2 billion-parameter vision encoder and the DeepSeek V3 large language model (LLM), dots.vlm1 demonstrates strong multimodal understanding and reasoning capabilities.

Model Highlights:

NaViT Vision Encoder: Trained entirely from scratch rather than fine-tuning an existing vision backbone. It natively supports dynamic resolution and incorporates pure visual supervision in addition to traditional text supervision, thereby enhancing the upper bound of perceptual capacity. Beyond image captioning datasets, a large amount of structured image data was introduced during pretraining to improve the model’s perceptual capabilities—particularly for tasks such as OCR.
Multimodal Training Data: In addition to conventional approaches, dots.vlm1 leverages a wide range of synthetic data strategies to cover diverse image types (e.g., tables, charts, documents, graphics) and descriptions (e.g., alt text, dense captions, grounding annotations). Furthermore, a strong multimodal model was used to rewrite web page data with interleaved text and images, significantly improving the quality of the training corpus.

Through large-scale pretraining and carefully tuned post-training, dots.vlm1 achieves near state-of-the-art performance in both visual perception and reasoning, setting a new performance ceiling for open-source vision-language models—while still maintaining competitive capabilities in pure-text tasks.

Special thanks to the DeepSeek team for the excellent DeepSeek V3 model.

2. Performance

		Qwen2.5VL-72B	Gemini2.5 Pro	Seed-VL1.5 thinking	dots.vlm1
STEM/Reasoning	MMMU	69.3	84.22	79.89	80.11
	MMMU_pro	51.91	76.5	68.9	70.11
	MathVision	39.4	72.34	68.77	69.64
	MathVista	74.6	83.5	86.1	85.0
	ZeroBench	2	5	2	4
	ZeroBench-sub	20	30.24	25.75	26.65
	VisuLogic	25.6	29.8	35.9	32.2
General Visual	MMbench-CN	88.2	89	89.78	88.24
	MMbench-EN	89.2	89.55	89.47	89.32
	MMStar	71.13	78.73	78.33	76.67
	RealWorldQA	75.9	78.43	78.69	79.08
	Vibe(GPT4o)	60.13	76.39	68.59	69.24
	m3gia(cn)	88.24	89.54	91.2	90.85
	SimpleVQA_ds	52.19	57.09	61.34	55.8
	MMVP	66	67.33	73.33	72
	HallusionBench	56.5	63.07	63.49	64.83
	CVBench	84.15	85.36	89.68	85.65
	Blink	61.7	71.86	72.38	66.33
OCR/Doc/Chart	charxiv(dq)	88.2	90.3	89.6	92.1
	charxiv(rq)	48.5	68.3	63.4	64.4
	OCRReasoning	38.02	70.81	63.42	66.23
	DOCVQA	96.23	95.42	93.65	96.52
	ChartQA	86.1	86.16	86.88	87.68
	OCRBenchV1	87.1	86.6	86.7	82.3
	AI2D	88.3	91.03	89.05	88.37
Grounding/Counting	RefCOCO	90.3	74.6	91.3	90.45
	CountBench	92.4	91.79	89	91.99
Multi Image	muir	69.38	70.5	79.77	78.58
	mantis	79.26	84.33	82.3	86.18
		Deepseek-R1-0528	Qwen3-235B-A22B	Qwen3-235B-A22B-think-2507	dots.vlm1
Text	LiveCodeBench	73.3	70.7	78.4	72.94
	AIME 2025	87.5	82.6	92.3	85.83
	GPQA	81	70.7	81.1	72.78

3. Usage

Environment Setup

You have two options to set up the environment:

Option 1: Using Base Image + Manual Installation

# Use the base SGLang image
docker run -it --gpus all lmsysorg/sglang:v0.4.9.post1-cu126

# Clone and install our custom SGLang branch
# IMPORTANT: Only our specific SGLang version supports dots.vlm1 models
# NOTE: This installation must be done on EVERY node in your cluster
# We have submitted a PR to the main SGLang repository (currently under review):
# https://github.com/sgl-project/sglang/pull/8778
git clone --branch dots.vlm1.v1 https://github.com/rednote-hilab/sglang sglang
pip install -e sglang/python

Option 2: Using Pre-built Image (Recommended)

# Use our pre-built image with dots.vlm1 support
docker run -it --gpus all rednotehilab/dots.vlm1_sglang:v0.4.9.post1-cu126

Multi-Node Deployment

Our model supports distributed deployment across multiple machines. Here's how to set up a 2-node cluster:

Prerequisites:

Model: rednote-hilab/dots.vlm1.inst
Node 1 IP: 10.0.0.1 (master node)
Node 2 IP: 10.0.0.2 (worker node)

Node 1 (Master - rank 0):

# Recommend downloading model locally to avoid timeout during startup
# Use: huggingface-cli download rednote-hilab/dots.vlm1.inst --local-dir ./dots.vlm1.inst
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"  # or local path like ./dots.vlm1.inst
# Get actual IP address: hostname -I | awk '{print $1}' or ip route get 1 | awk '{print $7}'
export MASTER_IP="10.0.0.1"  # Replace with actual master node IP
export API_PORT=15553

python3 -m sglang.launch_server \
    --model-path $HF_MODEL_PATH \
    --tp 16 \
    --dist-init-addr $MASTER_IP:23456 \
    --nnodes 2 \
    --node-rank 0 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port $API_PORT \
    --context-length 65536 \
    --max-running-requests 64 \
    --disable-radix-cache \
    --mem-fraction-static 0.8 \
    --chunked-prefill-size -1 \
    --chat-template dots-vlm \
    --cuda-graph-max-bs 64 \
    --quantization fp8

Node 2 (Worker - rank 1):

# Use the same variables as defined in Node 1
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"
export MASTER_IP="10.0.0.1"  # Must match Node 1
export API_PORT=15553

python3 -m sglang.launch_server \
    --model-path $HF_MODEL_PATH \
    --tp 16 \
    --dist-init-addr $MASTER_IP:23456 \
    --nnodes 2 \
    --node-rank 1 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port $API_PORT \
    --context-length 65536 \
    --max-running-requests 64 \
    --disable-radix-cache \
    --mem-fraction-static 0.8 \
    --chunked-prefill-size -1 \
    --chat-template dots-vlm \
    --cuda-graph-max-bs 64 \
    --quantization fp8

Configuration Parameters

Key parameters explanation:

--tp 16: Tensor parallelism across 16 GPUs per node
--nnodes 2: Total number of nodes in the cluster
--node-rank: Node identifier (0 for master, 1+ for workers)
--context-length 65536: Maximum context length
--quantization fp8: Use FP8 quantization for efficiency
--chat-template dots-vlm: Use custom chat template for dots.vlm model

API Usage

Once the servers are launched, you can access the model through OpenAI-compatible API:

# Use the same MASTER_IP and API_PORT as defined above
curl -X POST http://$MASTER_IP:$API_PORT/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "model",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Please briefly describe this image"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                        }
                    }
                ]
            }
        ],
        "temperature": 0.1,
        "top_p": 0.9,
        "max_tokens": 55000
    }'

rednote-hilab
/

dots.vlm1.inst

dots.vlm1

1. Introduction

2. Performance

3. Usage

Environment Setup

Option 1: Using Base Image + Manual Installation

Option 2: Using Pre-built Image (Recommended)

Multi-Node Deployment

Node 1 (Master - rank 0):

Node 2 (Worker - rank 1):

Configuration Parameters

API Usage

Collection including rednote-hilab/dots.vlm1.inst

dots.vlm1