An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
Abstract
Large Multimodal Models (LMMs) uniformly perceive video frames, creating computational inefficiency for videos with inherently varying temporal information density. This paper present Quicksviewer, an LMM with new perceiving paradigm that partitions a video of nonuniform density into varying cubes using Gumbel Softmax, followed by a unified resampling for each cube to achieve efficient video understanding. This simple and intuitive approach dynamically compress video online based on its temporal density, significantly reducing spatiotemporal redundancy (overall 45times compression rate), while enabling efficient training with large receptive field. We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency. With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy, demonstrating the effectiveness in performance. On Video-MME, Quicksviewer achieves SOTA under modest sequence lengths using just up to 5\% of tokens per frame required by baselines. With this paradigm, scaling up the number of input frames reveals a clear power law of the model capabilities. It is also empirically verified that the segments generated by the cubing network can help for analyzing continuous events in videos.
Community
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding (2025)
- Improving LLM Video Understanding with 16 Frames Per Second (2025)
- Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding (2025)
- Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation (2025)
- Token-Efficient Long Video Understanding for Multimodal LLMs (2025)
- Efficient Motion-Aware Video MLLM (2025)
- Slow-Fast Architecture for Video Multi-Modal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Good work! Very insightful. I have seen your work during ICLR 2025. Hope to see your work published in next venue.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper