pcuenq HF staff commited on
Commit
92b586b
·
verified ·
1 Parent(s): fea2189

Model card update (#2)

Browse files

- Model card update (e31c4f0f64dfb40c1c979376ed86838a48e1aff5)

Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -16,6 +16,10 @@ widget:
16
 
17
  # DepthPro: Monocular Depth Estimation
18
 
 
 
 
 
19
  ## Table of Contents
20
 
21
  - [DepthPro: Monocular Depth Estimation](#depthpro-monocular-depth-estimation)
@@ -34,8 +38,6 @@ widget:
34
 
35
  ## Model Details
36
 
37
- ![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_teaser.png)
38
-
39
  DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
40
 
41
  The abstract from the paper is the following:
 
16
 
17
  # DepthPro: Monocular Depth Estimation
18
 
19
+ ![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_teaser.png)
20
+
21
+ This is the transformers version of DepthPro, a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. For the checkpoint compatible with the original codebase, please check [this repo](https://huggingface.co/apple/DepthPro).
22
+
23
  ## Table of Contents
24
 
25
  - [DepthPro: Monocular Depth Estimation](#depthpro-monocular-depth-estimation)
 
38
 
39
  ## Model Details
40
 
 
 
41
  DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
42
 
43
  The abstract from the paper is the following: