Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images
Abstract
Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: https://github.com/Malga-Vision/Co-SemDepth
Community
A fast joint architecture for monocular depth estimation and semantic segmentation that has low number of parameters (5.2 M) while performing competetively on aerial datasets like MidAir and AeroScapes
Code: https://github.com/Malga-Vision/Co-SemDepth
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Leveraging Stable Diffusion for Monocular Depth Estimation via Image Semantic Encoding (2025)
- High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight (2025)
- Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion (2025)
- H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision (2025)
- LuSeg: Efficient Negative and Positive Obstacles Segmentation via Contrast-Driven Multi-Modal Feature Fusion on the Lunar (2025)
- VGGT: Visual Geometry Grounded Transformer (2025)
- Distilling Monocular Foundation Model for Fine-grained Depth Completion (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper