InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
Abstract
InfiniPot-V is a training-free, query-agnostic framework that compresses the key-value cache during video encoding to maintain a fixed memory cap for streaming video understanding, enhancing real-time performance and accuracy.
Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time--quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and two streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy--even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.
Community
InfiniPot-V enables memory-constrained streaming video processing through spatiotemporal/query-agnostic KV cache compression. Code will be released soon.
https://github.com/aiha-lab/InfiniPot-V
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval (2025)
- METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding (2025)
- HoliTom: Holistic Token Merging for Fast Video Large Language Models (2025)
- Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models (2025)
- KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction (2025)
- MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models (2025)
- R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper