YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Introduction

Introducing SAIL-VL-1.7-Thinking-2507, our latest reasoning model that achieves SOTA on the OpenCompass reasoning benchmark among comparably-sized models. Its architecture combines a SAILVIT vision encoder with the Qwen3-2B/7B language model, trained using the DAPO algorithm on a curated dataset of over 70,000 multimodal STEM examples. We are releasing this model open-source to facilitate community.

Performance

Model Size Average DynaMath LogicVista MathVerse MathVision WeMath MathVista_MINI
VLAA-Thinker-3B (Previous SOTA) 3B 35.4 18.2 38.5 36.4 24.4 33.8 61.0
InternVL3-2B 2B 29.1 14.8 34.7 24.5 20.2 22.9 57.6
Qwen2.5-VL-3B 3B 31.8 13.2 40.3 31.2 21.9 22.9 61.2
SAIL-VL-1.7-Thinking-2B-2507 2B 36.2 19.4 35.8 42.3 24.5 27.4 67.7
WeThink-7B (Previous SOTA) 8B 44.3 24.8 51.2 44.2 26.0 48.0 71.7
InternVL3-8B 8B 41.4 25.7 44.5 38.5 30.0 39.5 70.5
Qwen2.5-VL-7B 7B 40.1 21.8 47.9 41.1 25.4 36.2 68.1
SAIL-VL-1.7-Thinking-8B-2507 8B 45.8 29.6 43.6 57.1 31.7 39.62 73.4

Inference

We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.6.0, and transformers=4.52.3 as the development environment.

import torch
from transformers import AutoTokenizer, AutoModel, AutoProcessor
from PIL import Image

model_path = "your model path"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
device = torch.cuda.current_device()
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16,).to(device)

print("##### with images")
messages = [
    {"role": "user", "content": [{"type": "image", "image": 'image_path'}, 
    {"type": "text", "text": "describe the image"}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

image_path = 'your image path'
image = Image.open(image_path)
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device).to(torch.bfloat16)

generated_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response = response.split('<|im_end|>')[0].strip()
print(response)

print("##### without images")
messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": "中国的首都是哪里?"}]
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(images=None, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device).to(torch.bfloat16)
generated_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response = response.split('<|im_end|>')[0].strip()
print(response)

License

This project is licensed under Apache License 2.0.

Contact

If you have any question, please feel free to contact us: [email protected]

Downloads last month
-
Safetensors
Model size
8.91B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support