TUN3D: Towards Real-World Scene Understanding from Unposed Images
Abstract
TUN3D is a method for joint layout estimation and 3D object detection using multi-view images without depth sensors or ground-truth camera poses, achieving state-of-the-art performance on indoor scene understanding benchmarks.
Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .
Community
TUN3D works with GT point clouds, posed images (with known camera poses), or fully unposed image sets (without poses or depths).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations (2025)
- Sparse Multiview Open-Vocabulary 3D Detection (2025)
- A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding (2025)
- You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation (2025)
- No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views (2025)
- SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views (2025)
- G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper