CyberV: Cybernetics for Test-time Scaling in Video Understanding
Abstract
CyberV enhances Multimodal Large Language Models for video understanding through an adaptive, self-monitoring, and feedback-driven framework, improving performance on various benchmarks.
Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV.
Community
We propose a novel framework called CyberV inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. It impowers small models to outperform proprietary systems like GPT-4o, and enables large opensource models to achieve state-of-the-art results on VideoMMMU.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet (2025)
- ProxyThinker: Test-Time Guidance through Small Visual Reasoners (2025)
- AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security (2025)
- MINT: Memory-Infused Prompt Tuning at Test-time for CLIP (2025)
- TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling (2025)
- Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models (2025)
- Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper