Abstract
LongLive is a frame-level autoregressive framework for real-time and interactive long video generation, addressing efficiency and quality challenges through causal attention, KV-recache, streaming long tuning, and short window attention.
We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.
Community
TLDR: Turn interactive prompts into long videos—instantly, as you type!
Paper: https://arxiv.org/abs/2509.22622
Code: https://github.com/NVlabs/LongLive
Model: https://huggingface.co/Efficient-Large-Model/LongLive-1.3B
Demo Page: https://nvlabs.github.io/LongLive
Introduction Video: https://www.youtube.com/watch?v=CO1QC7BNvig
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation (2025)
- X-Streamer: Unified Human World Modeling with Audiovisual Interaction (2025)
- RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer (2025)
- Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation (2025)
- LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE (2025)
- Yan: Foundational Interactive Video Generation (2025)
- Mixture of Contexts for Long Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper