Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
Abstract
Autoregressive adversarial post-training transforms a pre-trained latent video diffusion model into a real-time, interactive video generator with efficient one-step generation and reduced error accumulation.
Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion (2025)
- Playing with Transformer at 30+ FPS via Next-Frame Diffusion (2025)
- TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models (2025)
- Learning World Models for Interactive Video Generation (2025)
- MAGI-1: Autoregressive Video Generation at Scale (2025)
- SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training (2025)
- LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper