VGGT-X: When VGGT Meets Dense Novel View Synthesis
Abstract
VGGT-X addresses VRAM and output quality issues in scaling 3D Foundation Models for dense Novel View Synthesis without relying on COLMAP.
We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/
Community
TL;DR: We present VGGT-X to explore what would happen if 3D Foundation Model is applied for dense Novel View Synthesis (NVS). It incorporates a memory-efficient VGGT implementation, an adaptive global alignment for VGGT output enhancement, and high-quality 3DGS training practices.
Project Page: https://dekuliutesla.github.io/vggt-x.github.io/
GitHub Repository: https://github.com/Linketic/VGGT-X
arXiv: https://arxiv.org/abs/2509.25191
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views (2025)
- LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos (2025)
- Segmentation-Driven Initialization for Sparse-view 3D Gaussian Splatting (2025)
- Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images (2025)
- Enhancing Novel View Synthesis from extremely sparse views with SfM-free 3D Gaussian Splatting Framework (2025)
- Surf3R: Rapid Surface Reconstruction from Sparse RGB Views in Seconds (2025)
- Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper