Skip to content

Commit 06fd797

Browse files
Add ZoeDepth (huggingface#30136)
* First draft * Add docs * Clean up code * Convert model * Add image processor * Convert Zoe_K * More improvements * Improve variable names and docstrings * Improve variable names * Improve variable names * Replace nn.sequential * More improvements * Convert ZoeD_NK * Fix most tests * Verify pixel values * Verify pixel values * Add squeeze * Update beit to support arbitrary window sizes * Improve image processor * Improve docstring * Improve beit * Improve model outputs * Add figure * Fix beit * Update checkpoint * Fix repo id * Add _keys_to_ignore_on_load_unexpected * More improvements * Address comments * Address comments * Address comments * Address comments * Rename variable name * Add backbone_hidden_size * Vectorize * Vectorize more * Address comments * Clarify docstring * Remove backbone_hidden_size * Fix image processor * Remove print statements * Remove print statement * Add integration test * Address comments * Address comments * Address comments * Address comments * Add requires_backends * Clean up * Simplify conversion script * Simplify more * Simplify more * Simplify more * Clean up * Make sure beit is loaded correctly * Address comment * Address bin_configurations * Use bin_configurations * Convert models, add integration tests * Fix doc test * Address comments * Unify regressor classes * Clarify arguments * Improve resize_image * Add num_relative_features * Address comment * [run-slow]beit,data2vec,zoedepth * [run-slow]beit,data2vec,zoedepth * Address comments * Address comment * Address comment * Replace nn.TransformerEncoderLayer and nn.TransformerEncoder * Replace nn.MultiheadAttention * Add attributes for patch transformer to config * Add tests for ensure_multiple_of * Update organization * Add tests * [run-slow] beit data2vec * Update ruff * [run-slow] beit data2vec * Add comment * Improve docstrings, add test * Fix interpolate_pos_encoding * Fix slow tests * Add docstring * Update src/transformers/models/zoedepth/image_processing_zoedepth.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/zoedepth/image_processing_zoedepth.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Improve tests and docstrings * Use run_common_tests * Improve docstrings * Improve docstrings * Improve tests * Improve tests * Remove print statements --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
1 parent 1082361 commit 06fd797

23 files changed

+3360
-76
lines changed

docs/source/en/_toctree.yml

+2
Original file line numberDiff line numberDiff line change
@@ -667,6 +667,8 @@
667667
title: ViTMSN
668668
- local: model_doc/yolos
669669
title: YOLOS
670+
- local: model_doc/zoedepth
671+
title: ZoeDepth
670672
title: Vision models
671673
- isExpanded: false
672674
sections:

docs/source/en/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -343,5 +343,6 @@ Flax), PyTorch, and/or TensorFlow.
343343
| [XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2) ||||
344344
| [YOLOS](model_doc/yolos) ||||
345345
| [YOSO](model_doc/yoso) ||||
346+
| [ZoeDepth](model_doc/zoedepth) ||||
346347

347348
<!-- End table-->

docs/source/en/model_doc/zoedepth.md

+108
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# ZoeDepth
18+
19+
## Overview
20+
21+
The ZoeDepth model was proposed in [ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth](https://arxiv.org/abs/2302.12288) by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the [DPT](dpt) framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.
22+
23+
The abstract from the paper is the following:
24+
25+
*This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains.*
26+
27+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/zoedepth_architecture_bis.png"
28+
alt="drawing" width="600"/>
29+
30+
<small> ZoeDepth architecture. Taken from the <a href="https://arxiv.org/abs/2302.12288">original paper.</a> </small>
31+
32+
This model was contributed by [nielsr](https://huggingface.co/nielsr).
33+
The original code can be found [here](https://github.com/isl-org/ZoeDepth).
34+
35+
## Usage tips
36+
37+
- ZoeDepth is an absolute (also called metric) depth estimation model, unlike DPT which is a relative depth estimation model. This means that ZoeDepth is able to estimate depth in metric units like meters.
38+
39+
The easiest to perform inference with ZoeDepth is by leveraging the [pipeline API](../main_classes/pipelines.md):
40+
41+
```python
42+
from transformers import pipeline
43+
from PIL import Image
44+
import requests
45+
46+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
47+
image = Image.open(requests.get(url, stream=True).raw)
48+
49+
pipe = pipeline(task="depth-estimation", model="Intel/zoedepth-nyu-kitti")
50+
result = pipe(image)
51+
depth = result["depth"]
52+
```
53+
54+
Alternatively, one can also perform inference using the classes:
55+
56+
```python
57+
from transformers import AutoImageProcessor, ZoeDepthForDepthEstimation
58+
import torch
59+
import numpy as np
60+
from PIL import Image
61+
import requests
62+
63+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
64+
image = Image.open(requests.get(url, stream=True).raw)
65+
66+
image_processor = AutoImageProcessor.from_pretrained("Intel/zoedepth-nyu-kitti")
67+
model = ZoeDepthForDepthEstimation.from_pretrained("Intel/zoedepth-nyu-kitti")
68+
69+
# prepare image for the model
70+
inputs = image_processor(images=image, return_tensors="pt")
71+
72+
with torch.no_grad():
73+
outputs = model(**inputs)
74+
predicted_depth = outputs.predicted_depth
75+
76+
# interpolate to original size
77+
prediction = torch.nn.functional.interpolate(
78+
predicted_depth.unsqueeze(1),
79+
size=image.size[::-1],
80+
mode="bicubic",
81+
align_corners=False,
82+
)
83+
84+
# visualize the prediction
85+
output = prediction.squeeze().cpu().numpy()
86+
formatted = (output * 255 / np.max(output)).astype("uint8")
87+
depth = Image.fromarray(formatted)
88+
```
89+
90+
## Resources
91+
92+
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ZoeDepth.
93+
94+
- A demo notebook regarding inference with ZoeDepth models can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ZoeDepth). 🌎
95+
96+
## ZoeDepthConfig
97+
98+
[[autodoc]] ZoeDepthConfig
99+
100+
## ZoeDepthImageProcessor
101+
102+
[[autodoc]] ZoeDepthImageProcessor
103+
- preprocess
104+
105+
## ZoeDepthForDepthEstimation
106+
107+
[[autodoc]] ZoeDepthForDepthEstimation
108+
- forward

src/transformers/__init__.py

+14
Original file line numberDiff line numberDiff line change
@@ -807,6 +807,7 @@
807807
"models.xmod": ["XmodConfig"],
808808
"models.yolos": ["YolosConfig"],
809809
"models.yoso": ["YosoConfig"],
810+
"models.zoedepth": ["ZoeDepthConfig"],
810811
"onnx": [],
811812
"pipelines": [
812813
"AudioClassificationPipeline",
@@ -1182,6 +1183,7 @@
11821183
_import_structure["models.vitmatte"].append("VitMatteImageProcessor")
11831184
_import_structure["models.vivit"].append("VivitImageProcessor")
11841185
_import_structure["models.yolos"].extend(["YolosFeatureExtractor", "YolosImageProcessor"])
1186+
_import_structure["models.zoedepth"].append("ZoeDepthImageProcessor")
11851187

11861188
try:
11871189
if not is_torchvision_available():
@@ -3586,6 +3588,12 @@
35863588
"YosoPreTrainedModel",
35873589
]
35883590
)
3591+
_import_structure["models.zoedepth"].extend(
3592+
[
3593+
"ZoeDepthForDepthEstimation",
3594+
"ZoeDepthPreTrainedModel",
3595+
]
3596+
)
35893597
_import_structure["optimization"] = [
35903598
"Adafactor",
35913599
"AdamW",
@@ -5497,6 +5505,7 @@
54975505
from .models.xmod import XmodConfig
54985506
from .models.yolos import YolosConfig
54995507
from .models.yoso import YosoConfig
5508+
from .models.zoedepth import ZoeDepthConfig
55005509

55015510
# Pipelines
55025511
from .pipelines import (
@@ -5872,6 +5881,7 @@
58725881
from .models.vitmatte import VitMatteImageProcessor
58735882
from .models.vivit import VivitImageProcessor
58745883
from .models.yolos import YolosFeatureExtractor, YolosImageProcessor
5884+
from .models.zoedepth import ZoeDepthImageProcessor
58755885

58765886
try:
58775887
if not is_torchvision_available():
@@ -7798,6 +7808,10 @@
77987808
YosoModel,
77997809
YosoPreTrainedModel,
78007810
)
7811+
from .models.zoedepth import (
7812+
ZoeDepthForDepthEstimation,
7813+
ZoeDepthPreTrainedModel,
7814+
)
78017815

78027816
# Optimization
78037817
from .optimization import (

src/transformers/image_utils.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -409,22 +409,22 @@ def validate_preprocess_arguments(
409409
410410
"""
411411
if do_rescale and rescale_factor is None:
412-
raise ValueError("rescale_factor must be specified if do_rescale is True.")
412+
raise ValueError("`rescale_factor` must be specified if `do_rescale` is `True`.")
413413

414414
if do_pad and size_divisibility is None:
415415
# Here, size_divisor might be passed as the value of size
416416
raise ValueError(
417-
"Depending on moel, size_divisibility, size_divisor, pad_size or size must be specified if do_pad is True."
417+
"Depending on the model, `size_divisibility`, `size_divisor`, `pad_size` or `size` must be specified if `do_pad` is `True`."
418418
)
419419

420420
if do_normalize and (image_mean is None or image_std is None):
421-
raise ValueError("image_mean and image_std must both be specified if do_normalize is True.")
421+
raise ValueError("`image_mean` and `image_std` must both be specified if `do_normalize` is `True`.")
422422

423423
if do_center_crop and crop_size is None:
424-
raise ValueError("crop_size must be specified if do_center_crop is True.")
424+
raise ValueError("`crop_size` must be specified if `do_center_crop` is `True`.")
425425

426426
if do_resize and (size is None or resample is None):
427-
raise ValueError("size and resample must be specified if do_resize is True.")
427+
raise ValueError("`size` and `resample` must be specified if `do_resize` is `True`.")
428428

429429

430430
# In the future we can add a TF implementation here when we have TF models.

src/transformers/models/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -263,4 +263,5 @@
263263
xmod,
264264
yolos,
265265
yoso,
266+
zoedepth,
266267
)

src/transformers/models/auto/configuration_auto.py

+2
Original file line numberDiff line numberDiff line change
@@ -291,6 +291,7 @@
291291
("xmod", "XmodConfig"),
292292
("yolos", "YolosConfig"),
293293
("yoso", "YosoConfig"),
294+
("zoedepth", "ZoeDepthConfig"),
294295
]
295296
)
296297

@@ -589,6 +590,7 @@
589590
("xmod", "X-MOD"),
590591
("yolos", "YOLOS"),
591592
("yoso", "YOSO"),
593+
("zoedepth", "ZoeDepth"),
592594
]
593595
)
594596

src/transformers/models/auto/image_processing_auto.py

+1
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,7 @@
142142
("vitmatte", ("VitMatteImageProcessor",)),
143143
("xclip", ("CLIPImageProcessor",)),
144144
("yolos", ("YolosImageProcessor",)),
145+
("zoedepth", ("ZoeDepthImageProcessor",)),
145146
]
146147
)
147148

src/transformers/models/auto/modeling_auto.py

+1
Original file line numberDiff line numberDiff line change
@@ -792,6 +792,7 @@
792792
("depth_anything", "DepthAnythingForDepthEstimation"),
793793
("dpt", "DPTForDepthEstimation"),
794794
("glpn", "GLPNForDepthEstimation"),
795+
("zoedepth", "ZoeDepthForDepthEstimation"),
795796
]
796797
)
797798
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(

0 commit comments

Comments
 (0)