TAPIP3D: Tracking Any Point in Persistent 3D Geometry
Abstract
We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camera-stabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera motion is effectively canceled. TAPIP3D iteratively refines multi-frame 3D motion estimates within this stabilized representation, enabling robust tracking over extended periods. To manage the inherent irregularities of 3D point distributions, we propose a Local Pair Attention mechanism. This 3D contextualization strategy effectively exploits spatial relationships in 3D, forming informative feature neighborhoods for precise 3D trajectory estimation. Our 3D-centric approach significantly outperforms existing 3D point tracking methods and even enhances 2D tracking accuracy compared to conventional 2D pixel trackers when accurate depth is available. It supports inference in both camera coordinates (i.e., unstabilized) and world coordinates, and our results demonstrate that compensating for camera motion improves tracking performance. Our approach replaces the conventional 2D square correlation neighborhoods used in prior 2D and 3D trackers, leading to more robust and accurate results across various 3D point tracking benchmarks. Project Page: https://tapip3d.github.io
Community
Long-term feed-forward 3D point tracking in persistent 3D point maps. Project page: https://tapip3d.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World (2025)
- POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction (2025)
- Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better (2025)
- Stereo Any Video: Temporally Consistent Stereo Matching (2025)
- AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos (2025)
- Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video (2025)
- Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper