apple
/

DepthPro-hf

@@ -16,6 +16,10 @@ widget:
 # DepthPro: Monocular Depth Estimation
 ## Table of Contents
 - [DepthPro: Monocular Depth Estimation](#depthpro-monocular-depth-estimation)
@@ -34,8 +38,6 @@ widget:
 ## Model Details
-![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_teaser.png)
 DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
 The abstract from the paper is the following:

 # DepthPro: Monocular Depth Estimation
+![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_teaser.png)
+This is the transformers version of DepthPro, a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. For the checkpoint compatible with the original codebase, please check [this repo](https://huggingface.co/apple/DepthPro).
 ## Table of Contents
 - [DepthPro: Monocular Depth Estimation](#depthpro-monocular-depth-estimation)
 ## Model Details
 DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
 The abstract from the paper is the following: