Model card update (#2)
Browse files- Model card update (e31c4f0f64dfb40c1c979376ed86838a48e1aff5)
README.md
CHANGED
@@ -16,6 +16,10 @@ widget:
|
|
16 |
|
17 |
# DepthPro: Monocular Depth Estimation
|
18 |
|
|
|
|
|
|
|
|
|
19 |
## Table of Contents
|
20 |
|
21 |
- [DepthPro: Monocular Depth Estimation](#depthpro-monocular-depth-estimation)
|
@@ -34,8 +38,6 @@ widget:
|
|
34 |
|
35 |
## Model Details
|
36 |
|
37 |
-
![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_teaser.png)
|
38 |
-
|
39 |
DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
|
40 |
|
41 |
The abstract from the paper is the following:
|
|
|
16 |
|
17 |
# DepthPro: Monocular Depth Estimation
|
18 |
|
19 |
+
![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_teaser.png)
|
20 |
+
|
21 |
+
This is the transformers version of DepthPro, a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. For the checkpoint compatible with the original codebase, please check [this repo](https://huggingface.co/apple/DepthPro).
|
22 |
+
|
23 |
## Table of Contents
|
24 |
|
25 |
- [DepthPro: Monocular Depth Estimation](#depthpro-monocular-depth-estimation)
|
|
|
38 |
|
39 |
## Model Details
|
40 |
|
|
|
|
|
41 |
DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
|
42 |
|
43 |
The abstract from the paper is the following:
|