BytedanceDouyinContent/SAIL-VL-1d7-Thinking-2B-2507

Introduction

Introducing SAIL-VL-1.7-Thinking-2507, our latest reasoning model that achieves SOTA on the OpenCompass reasoning benchmark among comparably-sized models. Its architecture combines a SAILVIT vision encoder with the Qwen3-2B/7B language model, trained using the DAPO algorithm on a curated dataset of over 70,000 multimodal STEM examples. We are releasing this model open-source to facilitate community.

Performance

Model	Size	Average	DynaMath	LogicVista	MathVerse	MathVision	WeMath	MathVista_MINI
VLAA-Thinker-3B (Previous SOTA)	3B	35.4	18.2	38.5	36.4	24.4	33.8	61.0
InternVL3-2B	2B	29.1	14.8	34.7	24.5	20.2	22.9	57.6
Qwen2.5-VL-3B	3B	31.8	13.2	40.3	31.2	21.9	22.9	61.2
SAIL-VL-1.7-Thinking-2B-2507	2B	36.2	19.4	35.8	42.3	24.5	27.4	67.7
WeThink-7B (Previous SOTA)	8B	44.3	24.8	51.2	44.2	26.0	48.0	71.7
InternVL3-8B	8B	41.4	25.7	44.5	38.5	30.0	39.5	70.5
Qwen2.5-VL-7B	7B	40.1	21.8	47.9	41.1	25.4	36.2	68.1
SAIL-VL-1.7-Thinking-8B-2507	8B	45.8	29.6	43.6	57.1	31.7	39.62	73.4

Inference

We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.6.0, and transformers=4.52.3 as the development environment.

import torch
from transformers import AutoTokenizer, AutoModel, AutoProcessor
from PIL import Image

model_path = "your model path"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
device = torch.cuda.current_device()
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16,).to(device)

print("##### with images")
messages = [
    {"role": "user", "content": [{"type": "image", "image": 'image_path'}, 
    {"type": "text", "text": "describe the image"}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

image_path = 'your image path'
image = Image.open(image_path)
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device).to(torch.bfloat16)

generated_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response = response.split('<|im_end|>')[0].strip()
print(response)

print("##### without images")
messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": "中国的首都是哪里？"}]
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(images=None, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device).to(torch.bfloat16)
generated_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response = response.split('<|im_end|>')[0].strip()
print(response)

License

This project is licensed under Apache License 2.0.

Contact

If you have any question, please feel free to contact us: [email protected]