AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes
Abstract
A two-stage paradigm adapts pre-trained Text-to-Video models for viewpoint prediction in 4D scenes by integrating an adaptive learning branch and a camera extrinsic diffusion branch.
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.
Community
TL;DR: We propose AdaViewPlanner, a framework that adapts pre-trained text-to-video models for automatic viewpoint planning in 4D scenes. Given 4D content and text prompts describing the scene context and desired camera motion, our model can generate coordinate-aligned camera pose sequences along with corresponding video visualizations. Leveraging the priors of video generation models, AdaViewPlanner demonstrates strong capability for smooth, diverse, instruction-following, and human-centric viewpoint planning in 4D scenes.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation (2025)
- MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis (2025)
- 4DNeX: Feed-Forward 4D Generative Modeling Made Easy (2025)
- FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction (2025)
- 4D Driving Scene Generation With Stereo Forcing (2025)
- Generating Human Motion Videos using a Cascaded Text-to-Video Framework (2025)
- CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper