Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Abstract
TON, a two-stage training strategy combining supervised fine-tuning with thought dropout and Group Relative Policy Optimization, reduces unnecessary reasoning steps in vision-language models without sacrificing performance.
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at https://github.com/kokolerk/TON.
Community
TL;DR: Teach multimodal models To think, Or Not to think (TON).
ArXiv: https://arxiv.org/abs/2505.16854
Github: https://github.com/kokolerk/TON
HF dataset and models: https://huggingface.co/collections/kolerk/ton-682ad9038395c21e228a645b
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL (2025)
- Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning (2025)
- SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning (2025)
- Thinkless: LLM Learns When to Think (2025)
- UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning (2025)
- Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement (2025)
- Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper