YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Introduction
Introducing SAIL-VL-1.7-Thinking-2507, our latest reasoning model that achieves SOTA on the OpenCompass reasoning benchmark among comparably-sized models. Its architecture combines a SAILVIT vision encoder with the Qwen3-2B/7B language model, trained using the DAPO algorithm on a curated dataset of over 70,000 multimodal STEM examples. We are releasing this model open-source to facilitate community.
Performance
Model | Size | Average | DynaMath | LogicVista | MathVerse | MathVision | WeMath | MathVista_MINI |
---|---|---|---|---|---|---|---|---|
VLAA-Thinker-3B (Previous SOTA) | 3B | 35.4 | 18.2 | 38.5 | 36.4 | 24.4 | 33.8 | 61.0 |
InternVL3-2B | 2B | 29.1 | 14.8 | 34.7 | 24.5 | 20.2 | 22.9 | 57.6 |
Qwen2.5-VL-3B | 3B | 31.8 | 13.2 | 40.3 | 31.2 | 21.9 | 22.9 | 61.2 |
SAIL-VL-1.7-Thinking-2B-2507 | 2B | 36.2 | 19.4 | 35.8 | 42.3 | 24.5 | 27.4 | 67.7 |
WeThink-7B (Previous SOTA) | 8B | 44.3 | 24.8 | 51.2 | 44.2 | 26.0 | 48.0 | 71.7 |
InternVL3-8B | 8B | 41.4 | 25.7 | 44.5 | 38.5 | 30.0 | 39.5 | 70.5 |
Qwen2.5-VL-7B | 7B | 40.1 | 21.8 | 47.9 | 41.1 | 25.4 | 36.2 | 68.1 |
SAIL-VL-1.7-Thinking-8B-2507 | 8B | 45.8 | 29.6 | 43.6 | 57.1 | 31.7 | 39.62 | 73.4 |
Inference
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.6.0, and transformers=4.52.3 as the development environment.
import torch
from transformers import AutoTokenizer, AutoModel, AutoProcessor
from PIL import Image
model_path = "your model path"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
device = torch.cuda.current_device()
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16,).to(device)
print("##### with images")
messages = [
{"role": "user", "content": [{"type": "image", "image": 'image_path'},
{"type": "text", "text": "describe the image"}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
image_path = 'your image path'
image = Image.open(image_path)
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device).to(torch.bfloat16)
generated_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response = response.split('<|im_end|>')[0].strip()
print(response)
print("##### without images")
messages = [
{
"role": "user",
"content": [{"type": "text", "text": "中国的首都是哪里?"}]
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(images=None, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device).to(torch.bfloat16)
generated_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response = response.split('<|im_end|>')[0].strip()
print(response)
License
This project is licensed under Apache License 2.0.
Contact
If you have any question, please feel free to contact us: [email protected]
- Downloads last month
- 43
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support