dots.vlm1

  🤗 Hugging Face   |    📄 Blog    |    🔗 GitHub  
🖥️ Demo   |   💬 WeChat (微信)   |   📕 rednote  

Visit our Hugging Face (click links above) or check out our live demo to try dots.vlm1! Enjoy!

1. Introduction

We are excited to introduce dots.vlm1, the first vision-language model in the dots model family. Built upon a 1.2 billion-parameter vision encoder and the DeepSeek V3 large language model (LLM), dots.vlm1 demonstrates strong multimodal understanding and reasoning capabilities.

Model Highlights:

  • NaViT Vision Encoder: Trained entirely from scratch rather than fine-tuning an existing vision backbone. It natively supports dynamic resolution and incorporates pure visual supervision in addition to traditional text supervision, thereby enhancing the upper bound of perceptual capacity. Beyond image captioning datasets, a large amount of structured image data was introduced during pretraining to improve the model’s perceptual capabilities—particularly for tasks such as OCR.
  • Multimodal Training Data: In addition to conventional approaches, dots.vlm1 leverages a wide range of synthetic data strategies to cover diverse image types (e.g., tables, charts, documents, graphics) and descriptions (e.g., alt text, dense captions, grounding annotations). Furthermore, a strong multimodal model was used to rewrite web page data with interleaved text and images, significantly improving the quality of the training corpus.

Through large-scale pretraining and carefully tuned post-training, dots.vlm1 achieves near state-of-the-art performance in both visual perception and reasoning, setting a new performance ceiling for open-source vision-language models—while still maintaining competitive capabilities in pure-text tasks.

Special thanks to the DeepSeek team for the excellent DeepSeek V3 model.

2. Performance

Qwen2.5VL-72B Gemini2.5 Pro Seed-VL1.5 thinking dots.vlm1
STEM/Reasoning MMMU 69.3 84.22 79.89 80.11
MMMU_pro 51.91 76.5 68.9 70.11
MathVision 39.4 72.34 68.77 69.64
MathVista 74.6 83.5 86.1 85.0
ZeroBench 2 5 2 4
ZeroBench-sub 20 30.24 25.75 26.65
VisuLogic 25.6 29.8 35.9 32.2
General Visual MMbench-CN 88.2 89 89.78 88.24
MMbench-EN 89.2 89.55 89.47 89.32
MMStar 71.13 78.73 78.33 76.67
RealWorldQA 75.9 78.43 78.69 79.08
Vibe(GPT4o) 60.13 76.39 68.59 69.24
m3gia(cn) 88.24 89.54 91.2 90.85
SimpleVQA_ds 52.19 57.09 61.34 55.8
MMVP 66 67.33 73.33 72
HallusionBench 56.5 63.07 63.49 64.83
CVBench 84.15 85.36 89.68 85.65
Blink 61.7 71.86 72.38 66.33
OCR/Doc/Chart charxiv(dq) 88.2 90.3 89.6 92.1
charxiv(rq) 48.5 68.3 63.4 64.4
OCRReasoning 38.02 70.81 63.42 66.23
DOCVQA 96.23 95.42 93.65 96.52
ChartQA 86.1 86.16 86.88 87.68
OCRBenchV1 87.1 86.6 86.7 82.3
AI2D 88.3 91.03 89.05 88.37
Grounding/Counting RefCOCO 90.3 74.6 91.3 90.45
CountBench 92.4 91.79 89 91.99
Multi Image muir 69.38 70.5 79.77 78.58
mantis 79.26 84.33 82.3 86.18
Deepseek-R1-0528 Qwen3-235B-A22B Qwen3-235B-A22B-think-2507 dots.vlm1
Text LiveCodeBench 73.3 70.7 78.4 72.94
AIME 2025 87.5 82.6 92.3 85.83
GPQA 81 70.7 81.1 72.78

3. Usage

Environment Setup

You have two options to set up the environment:

Option 1: Using Base Image + Manual Installation

# Use the base SGLang image
docker run -it --gpus all lmsysorg/sglang:v0.4.9.post1-cu126

# Clone and install our custom SGLang branch
# IMPORTANT: Only our specific SGLang version supports dots.vlm1 models
# NOTE: This installation must be done on EVERY node in your cluster
# We have submitted a PR to the main SGLang repository (currently under review):
# https://github.com/sgl-project/sglang/pull/8778
git clone --branch dots.vlm1.v1 https://github.com/rednote-hilab/sglang sglang
pip install -e sglang/python

Option 2: Using Pre-built Image (Recommended)

# Use our pre-built image with dots.vlm1 support
docker run -it --gpus all rednotehilab/dots.vlm1_sglang:v0.4.9.post1-cu126

Multi-Node Deployment

Our model supports distributed deployment across multiple machines. Here's how to set up a 2-node cluster:

Prerequisites:

  • Model: rednote-hilab/dots.vlm1.inst
  • Node 1 IP: 10.0.0.1 (master node)
  • Node 2 IP: 10.0.0.2 (worker node)

Node 1 (Master - rank 0):

# Recommend downloading model locally to avoid timeout during startup
# Use: huggingface-cli download rednote-hilab/dots.vlm1.inst --local-dir ./dots.vlm1.inst
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"  # or local path like ./dots.vlm1.inst
# Get actual IP address: hostname -I | awk '{print $1}' or ip route get 1 | awk '{print $7}'
export MASTER_IP="10.0.0.1"  # Replace with actual master node IP
export API_PORT=15553

python3 -m sglang.launch_server \
    --model-path $HF_MODEL_PATH \
    --tp 16 \
    --dist-init-addr $MASTER_IP:23456 \
    --nnodes 2 \
    --node-rank 0 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port $API_PORT \
    --context-length 65536 \
    --max-running-requests 64 \
    --disable-radix-cache \
    --mem-fraction-static 0.8 \
    --chunked-prefill-size -1 \
    --chat-template dots-vlm \
    --cuda-graph-max-bs 64 \
    --quantization fp8

Node 2 (Worker - rank 1):

# Use the same variables as defined in Node 1
export HF_MODEL_PATH="rednote-hilab/dots.vlm1.inst"
export MASTER_IP="10.0.0.1"  # Must match Node 1
export API_PORT=15553

python3 -m sglang.launch_server \
    --model-path $HF_MODEL_PATH \
    --tp 16 \
    --dist-init-addr $MASTER_IP:23456 \
    --nnodes 2 \
    --node-rank 1 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port $API_PORT \
    --context-length 65536 \
    --max-running-requests 64 \
    --disable-radix-cache \
    --mem-fraction-static 0.8 \
    --chunked-prefill-size -1 \
    --chat-template dots-vlm \
    --cuda-graph-max-bs 64 \
    --quantization fp8

Configuration Parameters

Key parameters explanation:

  • --tp 16: Tensor parallelism across 16 GPUs per node
  • --nnodes 2: Total number of nodes in the cluster
  • --node-rank: Node identifier (0 for master, 1+ for workers)
  • --context-length 65536: Maximum context length
  • --quantization fp8: Use FP8 quantization for efficiency
  • --chat-template dots-vlm: Use custom chat template for dots.vlm model

API Usage

Once the servers are launched, you can access the model through OpenAI-compatible API:

# Use the same MASTER_IP and API_PORT as defined above
curl -X POST http://$MASTER_IP:$API_PORT/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "model",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Please briefly describe this image"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                        }
                    }
                ]
            }
        ],
        "temperature": 0.1,
        "top_p": 0.9,
        "max_tokens": 55000
    }'
Downloads last month
4,312
Safetensors
Model size
672B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Collection including rednote-hilab/dots.vlm1.inst