Qwen
/

Text Generation
Transformers
Safetensors
qwen3_moe
conversational

Add image visual recognition output just like qwen 2.5 vl-32b instruct

#26
by devopsML - opened

Hi there,

Wouldn't it be OK if the devs can add image visual reasoning just like its predecessors like qvq-max or qwen2.5-vl-32b instruct? since many of this model's top competitors like gpt 4.1 or gemini 2.5 pro already has image visual reasoning + CoT reasoning?

There's some time gap between Qwen2.5 and Qwen2.5-VL. I think they could made one for the Qwen3 Family

then we hope they should instead of just adding a seperate model for that, the devs should be able to merge the image visual feature for qwen 3 family of models.

You may need to read the Qwen-VL technical report.

This comment has been hidden (marked as Resolved)

You may need to read the Qwen-VL technical report.

please put your qwen-vl technical report here

Qwen 3 might already be native multimodal, it's accepting images on their website and the tokenizer also has image tokens.

Perhaps the vision encoder is just not ready yet.

Sign up or log in to comment