Improve DinoV2 performance for classification tasks


We have an existing implementation of DinoV2 (https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/algo/classification/backbones/vision_transformer.py) in our repository, but it currently underperforms compared to vanilla CNNs in classification tasks, both in terms of accuracy and efficiency. This issue aims to investigate and improve the fine-tuning process of DinoV2, specifically focusing on fine-tuning only the linear classification head. However, full fine-tuning of the model can also be considered as a baseline for comparison.

Tasks:
1. Analyze Current Performance:

  -  Evaluate the accuracy and efficiency of the current DinoV2 implementation.
  -  Compare results with a "standard" CNN baseline (consider our implemented models in OTX like MobileNetV3, EfficientNetb0, EfficientNetV2-s, etc.).
2. Optimize Backbone and Linear Head Fine-Tuning: 

  - Identify potential improvements in the fine-tuning setup (e.g., learning rate, optimizer, data augmentation, loss function, input size and embedding dimension, etc.).
  - Implement changes while keeping them compatible with the current repository design.

 3. Consider Full Fine-Tuning (Baseline):

-  Experiment with full fine-tuning of DinoV2 as a baseline to compare performance gains.

 4. Benchmark and Report Results:
  -  Measure improvements in accuracy and efficiency. At least 4-5 diverse datasets should be considered with different number of images.
  -  Provide recipe (config) for e2e working solution (from torch fine-tuning to OpenVINO IR model)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve DinoV2 performance for classification tasks #4288

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve DinoV2 performance for classification tasks #4288

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions