Skip to content

Commit 587adfc

Browse files
authored
Add support for Depth Anything (huggingface#534)
* Add support for `DPTImageProcessor` * Add support for depth anything model * Update list of `depth_anything` models * Update processor test model id
1 parent 4fb23f2 commit 587adfc

File tree

6 files changed

+92
-1
lines changed

6 files changed

+92
-1
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
287287
1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
288288
1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
289289
1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
290+
1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.
290291
1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
291292
1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
292293
1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
2323
1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
2424
1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
25+
1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.
2526
1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
2627
1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
2728
1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.

scripts/supported_models.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -408,6 +408,15 @@
408408
'Intel/dpt-large',
409409
],
410410
},
411+
'depth_anything': {
412+
# Depth estimation
413+
# NOTE: requires --task depth-estimation
414+
'depth-estimation': [
415+
'LiheYoung/depth-anything-small-hf',
416+
'LiheYoung/depth-anything-base-hf',
417+
'LiheYoung/depth-anything-large-hf',
418+
],
419+
},
411420
'electra': {
412421
# Feature extraction
413422
'feature-extraction': [

src/models.js

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4027,6 +4027,16 @@ export class DPTModel extends DPTPreTrainedModel { }
40274027
export class DPTForDepthEstimation extends DPTPreTrainedModel { }
40284028
//////////////////////////////////////////////////
40294029

4030+
//////////////////////////////////////////////////
4031+
export class DepthAnythingPreTrainedModel extends PreTrainedModel { }
4032+
4033+
/**
4034+
* Depth Anything Model with a depth estimation head on top (consisting of 3 convolutional layers) e.g. for KITTI, NYUv2.
4035+
*/
4036+
export class DepthAnythingForDepthEstimation extends DepthAnythingPreTrainedModel { }
4037+
//////////////////////////////////////////////////
4038+
4039+
40304040
//////////////////////////////////////////////////
40314041
export class GLPNPreTrainedModel extends PreTrainedModel { }
40324042

@@ -5391,6 +5401,7 @@ const MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES = new Map([
53915401

53925402
const MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES = new Map([
53935403
['dpt', ['DPTForDepthEstimation', DPTForDepthEstimation]],
5404+
['depth_anything', ['DepthAnythingForDepthEstimation', DepthAnythingForDepthEstimation]],
53945405
['glpn', ['GLPNForDepthEstimation', GLPNForDepthEstimation]],
53955406
])
53965407

src/processors.js

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,29 @@ function validate_audio_inputs(audio, feature_extractor) {
164164
}
165165
}
166166

167+
/**
168+
* Helper function to constrain a value to be a multiple of a number.
169+
* @param {number} val The value to constrain.
170+
* @param {number} multiple The number to constrain to.
171+
* @param {number} [minVal=0] The minimum value to constrain to.
172+
* @param {number} [maxVal=null] The maximum value to constrain to.
173+
* @returns {number} The constrained value.
174+
* @private
175+
*/
176+
function constraint_to_multiple_of(val, multiple, minVal = 0, maxVal = null) {
177+
let x = Math.round(val / multiple) * multiple;
178+
179+
if (maxVal !== null && x > maxVal) {
180+
x = Math.floor(val / multiple) * multiple;
181+
}
182+
183+
if (x < minVal) {
184+
x = Math.ceil(val / multiple) * multiple;
185+
}
186+
187+
return x;
188+
}
189+
167190
/**
168191
* Base class for feature extractors.
169192
*
@@ -465,7 +488,31 @@ export class ImageFeatureExtractor extends FeatureExtractor {
465488

466489
} else if (size !== undefined && size.width !== undefined && size.height !== undefined) {
467490
// If `width` and `height` are set, resize to those dimensions
468-
return [size.width, size.height];
491+
492+
let newWidth = size.width;
493+
let newHeight = size.height;
494+
495+
// Custom for DPT models
496+
if (this.config.keep_aspect_ratio && this.config.ensure_multiple_of) {
497+
498+
// determine new height and width
499+
let scale_height = size.height / srcHeight;
500+
let scale_width = size.width / srcWidth;
501+
502+
// scale as little as possible
503+
if (Math.abs(1 - scale_width) < Math.abs(1 - scale_height)) {
504+
// fit width
505+
scale_height = scale_width;
506+
} else {
507+
// fit height
508+
scale_width = scale_height;
509+
}
510+
511+
newHeight = constraint_to_multiple_of(scale_height * srcHeight, this.config.ensure_multiple_of);
512+
newWidth = constraint_to_multiple_of(scale_width * srcWidth, this.config.ensure_multiple_of);
513+
}
514+
515+
return [newWidth, newHeight];
469516

470517
} else if (this.size_divisibility !== undefined) {
471518
// Rounds the height and width down to the closest multiple of size_divisibility
@@ -699,6 +746,7 @@ export class SegformerFeatureExtractor extends ImageFeatureExtractor {
699746
return toReturn;
700747
}
701748
}
749+
export class DPTImageProcessor extends ImageFeatureExtractor { }
702750
export class BitImageProcessor extends ImageFeatureExtractor { }
703751
export class DPTFeatureExtractor extends ImageFeatureExtractor { }
704752
export class GLPNFeatureExtractor extends ImageFeatureExtractor { }
@@ -1881,6 +1929,7 @@ export class AutoProcessor {
18811929
ConvNextImageProcessor,
18821930
SegformerFeatureExtractor,
18831931
BitImageProcessor,
1932+
DPTImageProcessor,
18841933
DPTFeatureExtractor,
18851934
GLPNFeatureExtractor,
18861935
BeitFeatureExtractor,

tests/processors.test.js

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ describe('Processors', () => {
3939
detr: 'facebook/detr-resnet-50',
4040
yolos: 'hustvl/yolos-small-300',
4141
dpt: 'Intel/dpt-hybrid-midas',
42+
dpt_2: 'LiheYoung/depth-anything-small-hf',
4243
glpn: 'vinvino02/glpn-kitti',
4344
nougat: 'facebook/nougat-small',
4445
owlvit: 'google/owlvit-base-patch32',
@@ -407,6 +408,25 @@ describe('Processors', () => {
407408
compare(reshaped_input_sizes, [[224, 224]]);
408409
}
409410
}, MAX_TEST_EXECUTION_TIME);
411+
412+
// DPTImageProcessor
413+
// - tests ensure_multiple_of
414+
// - tests keep_aspect_ratio
415+
it(MODELS.dpt_2, async () => {
416+
const processor = await AutoProcessor.from_pretrained(m(MODELS.dpt_2))
417+
418+
{
419+
const image = await load_image(TEST_IMAGES.cats);
420+
const { pixel_values, original_sizes, reshaped_input_sizes } = await processor(image);
421+
422+
compare(pixel_values.dims, [1, 3, 518, 686]);
423+
compare(avg(pixel_values.data), 0.30337387323379517);
424+
425+
compare(original_sizes, [[480, 640]]);
426+
compare(reshaped_input_sizes, [[518, 686]]);
427+
}
428+
}, MAX_TEST_EXECUTION_TIME);
429+
410430
});
411431

412432
describe('Audio processors', () => {

0 commit comments

Comments
 (0)