STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
Abstract
STream3R reformulates 3D reconstruction as a decoder-only Transformer problem, using causal attention to efficiently process image sequences and outperform existing methods in both static and dynamic scenes.
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.
Community
TL;DR: STream3R reformulates dense 3D reconstruction into a sequential registration task with causal attention.
⨠Project Page
š» Code
Epic work!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Streaming 4D Visual Geometry Transformer (2025)
- LONG3R: Long Sequence Streaming 3D Reconstruction (2025)
- Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory (2025)
- MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second (2025)
- No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views (2025)
- SpatialTrackerV2: 3D Point Tracking Made Easy (2025)
- Dens3R: A Foundation Model for 3D Geometry Prediction (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper