Add support for Depth Anything (huggingface#534)

xenova · web-flow · commit 587adfc193e8 · 2024-01-25T15:21:51.000+02:00
* Add support for `DPTImageProcessor`

* Add support for depth anything model

* Update list of `depth_anything` models

* Update processor test model id
diff --git a/README.md b/README.md
@@ -287,6 +287,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
 1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
 1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
+1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.
 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
 1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
diff --git a/docs/snippets/6_supported-models.snippet b/docs/snippets/6_supported-models.snippet
@@ -22,6 +22,7 @@
 1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
 1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
+1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.
 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
 1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
diff --git a/scripts/supported_models.py b/scripts/supported_models.py
@@ -408,6 +408,15 @@
             'Intel/dpt-large',
         ],
     },
+    'depth_anything': {
+        # Depth estimation
+        # NOTE: requires --task depth-estimation
+        'depth-estimation': [
+            'LiheYoung/depth-anything-small-hf',
+            'LiheYoung/depth-anything-base-hf',
+            'LiheYoung/depth-anything-large-hf',
+        ],
+    },
     'electra': {
         # Feature extraction
         'feature-extraction': [
diff --git a/src/models.js b/src/models.js
@@ -4027,6 +4027,16 @@ export class DPTModel extends DPTPreTrainedModel { }
 export class DPTForDepthEstimation extends DPTPreTrainedModel { }
 //////////////////////////////////////////////////
 
+//////////////////////////////////////////////////
+export class DepthAnythingPreTrainedModel extends PreTrainedModel { }
+
+/**
+ * Depth Anything Model with a depth estimation head on top (consisting of 3 convolutional layers) e.g. for KITTI, NYUv2.
+ */
+export class DepthAnythingForDepthEstimation extends DepthAnythingPreTrainedModel { }
+//////////////////////////////////////////////////
+
+
 //////////////////////////////////////////////////
 export class GLPNPreTrainedModel extends PreTrainedModel { }
 
@@ -5391,6 +5401,7 @@ const MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES = new Map([
 
 const MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES = new Map([
     ['dpt', ['DPTForDepthEstimation', DPTForDepthEstimation]],
+    ['depth_anything', ['DepthAnythingForDepthEstimation', DepthAnythingForDepthEstimation]],
     ['glpn', ['GLPNForDepthEstimation', GLPNForDepthEstimation]],
 ])
 
diff --git a/src/processors.js b/src/processors.js
@@ -164,6 +164,29 @@ function validate_audio_inputs(audio, feature_extractor) {
     }
 }
 
+/**
+ * Helper function to constrain a value to be a multiple of a number.
+ * @param {number} val The value to constrain.
+ * @param {number} multiple The number to constrain to.
+ * @param {number} [minVal=0] The minimum value to constrain to.
+ * @param {number} [maxVal=null] The maximum value to constrain to.
+ * @returns {number} The constrained value.
+ * @private
+ */
+function constraint_to_multiple_of(val, multiple, minVal = 0, maxVal = null) {
+    let x = Math.round(val / multiple) * multiple;
+
+    if (maxVal !== null && x > maxVal) {
+        x = Math.floor(val / multiple) * multiple;
+    }
+
+    if (x < minVal) {
+        x = Math.ceil(val / multiple) * multiple;
+    }
+
+    return x;
+}
+
 /**
  * Base class for feature extractors.
  *
@@ -465,7 +488,31 @@ export class ImageFeatureExtractor extends FeatureExtractor {
 
         } else if (size !== undefined && size.width !== undefined && size.height !== undefined) {
             // If `width` and `height` are set, resize to those dimensions
-            return [size.width, size.height];
+
+            let newWidth = size.width;
+            let newHeight = size.height;
+
+            // Custom for DPT models
+            if (this.config.keep_aspect_ratio && this.config.ensure_multiple_of) {
+
+                // determine new height and width
+                let scale_height = size.height / srcHeight;
+                let scale_width = size.width / srcWidth;
+
+                // scale as little as possible
+                if (Math.abs(1 - scale_width) < Math.abs(1 - scale_height)) {
+                    // fit width
+                    scale_height = scale_width;
+                } else {
+                    // fit height
+                    scale_width = scale_height;
+                }
+
+                newHeight = constraint_to_multiple_of(scale_height * srcHeight, this.config.ensure_multiple_of);
+                newWidth = constraint_to_multiple_of(scale_width * srcWidth, this.config.ensure_multiple_of);
+            }
+
+            return [newWidth, newHeight];
 
         } else if (this.size_divisibility !== undefined) {
             // Rounds the height and width down to the closest multiple of size_divisibility
@@ -699,6 +746,7 @@ export class SegformerFeatureExtractor extends ImageFeatureExtractor {
         return toReturn;
     }
 }
+export class DPTImageProcessor extends ImageFeatureExtractor { }
 export class BitImageProcessor extends ImageFeatureExtractor { }
 export class DPTFeatureExtractor extends ImageFeatureExtractor { }
 export class GLPNFeatureExtractor extends ImageFeatureExtractor { }
@@ -1881,6 +1929,7 @@ export class AutoProcessor {
         ConvNextImageProcessor,
         SegformerFeatureExtractor,
         BitImageProcessor,
+        DPTImageProcessor,
         DPTFeatureExtractor,
         GLPNFeatureExtractor,
         BeitFeatureExtractor,
diff --git a/tests/processors.test.js b/tests/processors.test.js
@@ -39,6 +39,7 @@ describe('Processors', () => {
             detr: 'facebook/detr-resnet-50',
             yolos: 'hustvl/yolos-small-300',
             dpt: 'Intel/dpt-hybrid-midas',
+            dpt_2: 'LiheYoung/depth-anything-small-hf',
             glpn: 'vinvino02/glpn-kitti',
             nougat: 'facebook/nougat-small',
             owlvit: 'google/owlvit-base-patch32',
@@ -407,6 +408,25 @@ describe('Processors', () => {
                 compare(reshaped_input_sizes, [[224, 224]]);
             }
         }, MAX_TEST_EXECUTION_TIME);
+
+        // DPTImageProcessor
+        //  - tests ensure_multiple_of
+        //  - tests keep_aspect_ratio
+        it(MODELS.dpt_2, async () => {
+            const processor = await AutoProcessor.from_pretrained(m(MODELS.dpt_2))
+
+            {
+                const image = await load_image(TEST_IMAGES.cats);
+                const { pixel_values, original_sizes, reshaped_input_sizes } = await processor(image);
+
+                compare(pixel_values.dims, [1, 3, 518, 686]);
+                compare(avg(pixel_values.data), 0.30337387323379517);
+
+                compare(original_sizes, [[480, 640]]);
+                compare(reshaped_input_sizes, [[518, 686]]);
+            }
+        }, MAX_TEST_EXECUTION_TIME);
+
     });
 
     describe('Audio processors', () => {