Skip to content

Improve DinoV2 performance for classification tasks #4288

@kprokofi

Description

@kprokofi

We have an existing implementation of DinoV2 (https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/algo/classification/backbones/vision_transformer.py) in our repository, but it currently underperforms compared to vanilla CNNs in classification tasks, both in terms of accuracy and efficiency. This issue aims to investigate and improve the fine-tuning process of DinoV2, specifically focusing on fine-tuning only the linear classification head. However, full fine-tuning of the model can also be considered as a baseline for comparison.

Tasks:

  1. Analyze Current Performance:
  • Evaluate the accuracy and efficiency of the current DinoV2 implementation.
  • Compare results with a "standard" CNN baseline (consider our implemented models in OTX like MobileNetV3, EfficientNetb0, EfficientNetV2-s, etc.).
  1. Optimize Backbone and Linear Head Fine-Tuning:
  • Identify potential improvements in the fine-tuning setup (e.g., learning rate, optimizer, data augmentation, loss function, input size and embedding dimension, etc.).
  • Implement changes while keeping them compatible with the current repository design.
  1. Consider Full Fine-Tuning (Baseline):
  • Experiment with full fine-tuning of DinoV2 as a baseline to compare performance gains.
  1. Benchmark and Report Results:
  • Measure improvements in accuracy and efficiency. At least 4-5 diverse datasets should be considered with different number of images.
  • Provide recipe (config) for e2e working solution (from torch fine-tuning to OpenVINO IR model)

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions