VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video Diffusion Models
Abstract
VideoFrom3D synthesizes high-quality 3D scene videos using a combination of image and video diffusion models, achieving style consistency without requiring paired datasets.
In this paper, we propose VideoFrom3D, a novel framework for synthesizing high-quality 3D scene videos from coarse geometry, a camera trajectory, and a reference image. Our approach streamlines the 3D graphic design workflow, enabling flexible design exploration and rapid production of deliverables. A straightforward approach to synthesizing a video from coarse geometry might condition a video diffusion model on geometric structure. However, existing video diffusion models struggle to generate high-fidelity results for complex scenes due to the difficulty of jointly modeling visual quality, motion, and temporal consistency. To address this, we propose a generative framework that leverages the complementary strengths of image and video diffusion models. Specifically, our framework consists of a Sparse Anchor-view Generation (SAG) and a Geometry-guided Generative Inbetweening (GGI) module. The SAG module generates high-quality, cross-view consistent anchor views using an image diffusion model, aided by Sparse Appearance-guided Sampling. Building on these anchor views, GGI module faithfully interpolates intermediate frames using a video diffusion model, enhanced by flow-based camera control and structural guidance. Notably, both modules operate without any paired dataset of 3D scene models and natural images, which is extremely difficult to obtain. Comprehensive experiments show that our method produces high-quality, style-consistent scene videos under diverse and challenging scenarios, outperforming simple and extended baselines.
Community
VideoFrom3D synthesizes high-quality 3D scene videos given coarse geometry, a camera trajectory, and a reference image. This approach streamlines the 3D graphic design workflow, enabling flexible design exploration and rapid production of deliverables.
A straightforward approach to synthesizing such a video might condition a video diffusion model on geometric structure. However, existing video diffusion models struggle to generate high-fidelity results for complex scenes due to the difficulty of jointly modeling visual quality, motion, and temporal consistency. To address this, VideoFrom3D combines the complementary strengths of high-quality visuals from an image diffusion model and strong temporal consistency from a video diffusion model, producing photorealistic, style-consistent videos across diverse scenarios.
Project page: https://kimgeonung.github.io/VideoFrom3D
Code: https://github.com/KIMGEONUNG/VideoFrom3D
ArXiv: https://arxiv.org/abs/2509.17985
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control (2025)
- Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing (2025)
- GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors (2025)
- S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix (2025)
- 4DNeX: Feed-Forward 4D Generative Modeling Made Easy (2025)
- SPATIALGEN: Layout-guided 3D Indoor Scene Generation (2025)
- Tinker: Diffusion's Gift to 3D-Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper