geetu040 commited on
Commit
3ba7c8b
·
1 Parent(s): 05a40b7

update model card

Browse files
README.md CHANGED
@@ -1,77 +1,159 @@
1
  ---
 
2
  license: apple-ascl
 
 
 
3
  pipeline_tag: depth-estimation
 
 
 
 
 
 
 
4
  ---
5
 
6
  # DepthPro: Monocular Depth Estimation
7
 
8
- Install the required libraries:
9
- ```bash
10
- pip install -q numpy pillow torch torchvision
11
- pip install -q git+https://github.com/geetu040/transformers.git@depth-pro#egg=transformers
12
- ```
13
 
14
- Import the required libraries:
15
- ```py
16
- import requests
17
- from PIL import Image
18
- import torch
19
- import torch.nn as nn
20
- import torch.nn.functional as F
21
- from huggingface_hub import hf_hub_download
22
- import matplotlib.pyplot as plt
23
-
24
- # custom installation from this PR: https://github.com/huggingface/transformers/pull/34583
25
- # !pip install git+https://github.com/geetu040/transformers.git@depth-pro#egg=transformers
26
- from transformers import DepthProConfig, DepthProImageProcessorFast, DepthProForDepthEstimation
27
- ```
28
 
29
- Load the model and image processor:
30
- ```py
31
- checkpoint = "geetu040/DepthPro"
32
- revision = "main"
33
- image_processor = DepthProImageProcessorFast.from_pretrained(checkpoint, revision=revision)
34
- model = DepthProForDepthEstimation.from_pretrained(checkpoint, revision=revision)
35
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
36
- model = model.to(device)
37
- ```
 
 
38
 
39
- Inference:
40
- ```py
41
- # inference
42
 
43
- url = "https://huggingface.co/geetu040/DepthPro/resolve/main/assets/tiger.jpg"
44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  image = Image.open(requests.get(url, stream=True).raw)
46
- image = image.convert("RGB")
47
 
48
- # prepare image for the model
 
 
49
  inputs = image_processor(images=image, return_tensors="pt")
50
- inputs = {k: v.to(device) for k, v in inputs.items()}
51
 
52
  with torch.no_grad():
53
- outputs = model(**inputs)
54
 
55
- # interpolate to original size
56
  post_processed_output = image_processor.post_process_depth_estimation(
57
- outputs, target_sizes=[(image.height, image.width)],
58
  )
59
 
60
- # visualize the prediction
61
  depth = post_processed_output[0]["predicted_depth"]
62
  depth = (depth - depth.min()) / depth.max()
63
  depth = depth * 255.
64
  depth = depth.detach().cpu().numpy()
65
  depth = Image.fromarray(depth.astype("uint8"))
 
 
 
66
 
67
- # visualize the prediction
68
- fig, axes = plt.subplots(1, 2, figsize=(20, 20))
69
- axes[0].imshow(image)
70
- axes[0].set_title(f'Image {image.size}')
71
- axes[0].axis('off')
72
- axes[1].imshow(depth)
73
- axes[1].set_title(f'Depth {depth.size}')
74
- axes[1].axis('off')
75
- plt.subplots_adjust(wspace=0, hspace=0)
76
- plt.show()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ```
 
 
 
 
 
1
  ---
2
+ library_name: transformers
3
  license: apple-ascl
4
+ tags:
5
+ - vision
6
+ - depth-estimation
7
  pipeline_tag: depth-estimation
8
+ widget:
9
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
10
+ example_title: Tiger
11
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
12
+ example_title: Teapot
13
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
14
+ example_title: Palace
15
  ---
16
 
17
  # DepthPro: Monocular Depth Estimation
18
 
19
+ ## Table of Contents
 
 
 
 
20
 
21
+ - [DepthPro: Monocular Depth Estimation](#depthpro-monocular-depth-estimation)
22
+ - [Table of Contents](#table-of-contents)
23
+ - [Model Details](#model-details)
24
+ - [Model Sources](#model-sources)
25
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
26
+ - [Training Details](#training-details)
27
+ - [Training Data](#training-data)
28
+ - [Preprocessing](#preprocessing)
29
+ - [Training Hyperparameters](#training-hyperparameters)
30
+ - [Evaluation](#evaluation)
31
+ - [Model Architecture and Objective](#model-architecture-and-objective)
32
+ - [Citation](#citation)
33
+ - [Model Card Authors](#model-card-authors)
 
34
 
35
+ ## Model Details
36
+
37
+ ![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_teaser.png)
38
+
39
+ DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
40
+
41
+ The abstract from the paper is the following:
42
+
43
+ > We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions.
44
+
45
+ This is the model card of a 🤗 [transformers](https://huggingface.co/docs/transformers/index) model that has been pushed on the Hub.
46
 
47
+ - **Developed by:** Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, Vladlen Koltun.
48
+ - **Model type:** [DepthPro](https://huggingface.co/docs/transformers/main/en/model_doc/depth_pro)
49
+ - **License:** Apple-ASCL
50
 
51
+ ### Model Sources
52
 
53
+ <!-- Provide the basic links for the model. -->
54
+
55
+ - **HF Docs:** [DepthPro](https://huggingface.co/docs/transformers/main/en/model_doc/depth_pro)
56
+ - **Repository:** https://github.com/apple/ml-depth-pro
57
+ - **Paper:** https://arxiv.org/abs/2410.02073
58
+
59
+ ## How to Get Started with the Model
60
+
61
+ Use the code below to get started with the model.
62
+
63
+ ```python
64
+ import requests
65
+ from PIL import Image
66
+ import torch
67
+ from transformers import DepthProImageProcessorFast, DepthProForDepthEstimation
68
+
69
+ url = 'https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg'
70
  image = Image.open(requests.get(url, stream=True).raw)
 
71
 
72
+ image_processor = DepthProImageProcessorFast.from_pretrained("geetu040/DepthPro")
73
+ model = DepthProForDepthEstimation.from_pretrained("geetu040/DepthPro")
74
+
75
  inputs = image_processor(images=image, return_tensors="pt")
 
76
 
77
  with torch.no_grad():
78
+ outputs = model(**inputs)
79
 
 
80
  post_processed_output = image_processor.post_process_depth_estimation(
81
+ outputs, target_sizes=[(image.height, image.width)],
82
  )
83
 
84
+ fov = post_processed_output[0]["fov"]
85
  depth = post_processed_output[0]["predicted_depth"]
86
  depth = (depth - depth.min()) / depth.max()
87
  depth = depth * 255.
88
  depth = depth.detach().cpu().numpy()
89
  depth = Image.fromarray(depth.astype("uint8"))
90
+ ```
91
+
92
+ ## Training Details
93
 
94
+ ### Training Data
95
+
96
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
97
+
98
+ The DepthPro model was trained on the following datasets:
99
+
100
+ ![image/jpeg](assets/depth-pro-datasets.jpeg)
101
+
102
+ ### Preprocessing
103
+
104
+ Images go through the following preprocessing steps:
105
+ - rescaled by `1/225.`
106
+ - normalized with `mean=[0.5, 0.5, 0.5]` and `std=[0.5, 0.5, 0.5]`
107
+ - resized to `1536x1536` pixels
108
+
109
+ ### Training Hyperparameters
110
+
111
+ ![image/jpeg](assets/depth-pro-training-hyper-parameters.jpeg)
112
+
113
+ ## Evaluation
114
+
115
+ ![image/png](assets/depth-pro-results-depth.png)
116
+ ![image/png](assets/depth-pro-results-boundary.png)
117
+ ![image/png](assets/depth-pro-results-fov.png)
118
+
119
+ ### Model Architecture and Objective
120
+
121
+ ![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_pro_architecture.png)
122
+
123
+ The `DepthProForDepthEstimation` model uses a `DepthProEncoder`, for encoding the input image and a `FeatureFusionStage` for fusing the output features from encoder.
124
+
125
+ The `DepthProEncoder` further uses two encoders:
126
+ - `patch_encoder`
127
+ - Input image is scaled with multiple ratios, as specified in the `scaled_images_ratios` configuration.
128
+ - Each scaled image is split into smaller **patches** of size `patch_size` with overlapping areas determined by `scaled_images_overlap_ratios`.
129
+ - These patches are processed by the **`patch_encoder`**
130
+ - `image_encoder`
131
+ - Input image is also rescaled to `patch_size` and processed by the **`image_encoder`**
132
+
133
+ Both these encoders can be configured via `patch_model_config` and `image_model_config` respectively, both of which are seperate `Dinov2Model` by default.
134
+
135
+ Outputs from both encoders (`last_hidden_state`) and selected intermediate states (`hidden_states`) from **`patch_encoder`** are fused by a `DPT`-based `FeatureFusionStage` for depth estimation.
136
+
137
+ The network is supplemented with a focal length estimation head. A small convolutional head ingests frozen features from the depth estimation network and task-specific features from a separate ViT image encoder to predict the horizontal angular field-of-view.
138
+
139
+ ## Citation
140
+
141
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
142
+
143
+ **BibTeX:**
144
+
145
+ ```bibtex
146
+ @misc{bochkovskii2024depthprosharpmonocular,
147
+ title={Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
148
+ author={Aleksei Bochkovskii and Amaël Delaunoy and Hugo Germain and Marcel Santos and Yichao Zhou and Stephan R. Richter and Vladlen Koltun},
149
+ year={2024},
150
+ eprint={2410.02073},
151
+ archivePrefix={arXiv},
152
+ primaryClass={cs.CV},
153
+ url={https://arxiv.org/abs/2410.02073},
154
+ }
155
  ```
156
+
157
+ ## Model Card Authors
158
+
159
+ [Armaghan Shakir](https://huggingface.co/geetu040)
assets/architecture.jpg DELETED
Binary file (85 kB)
 
assets/depth-pro-datasets.jpeg ADDED
assets/depth-pro-results-boundary.png ADDED
assets/depth-pro-results-depth.png ADDED
assets/depth-pro-results-fov.png ADDED
assets/depth-pro-training-hyper-parameters.jpeg ADDED
assets/tiger.jpg DELETED
Binary file (433 kB)