-
Notifications
You must be signed in to change notification settings - Fork 459
Labels
Description
We have an existing implementation of DinoV2 (https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/algo/classification/backbones/vision_transformer.py) in our repository, but it currently underperforms compared to vanilla CNNs in classification tasks, both in terms of accuracy and efficiency. This issue aims to investigate and improve the fine-tuning process of DinoV2, specifically focusing on fine-tuning only the linear classification head. However, full fine-tuning of the model can also be considered as a baseline for comparison.
Tasks:
- Analyze Current Performance:
- Evaluate the accuracy and efficiency of the current DinoV2 implementation.
- Compare results with a "standard" CNN baseline (consider our implemented models in OTX like MobileNetV3, EfficientNetb0, EfficientNetV2-s, etc.).
- Optimize Backbone and Linear Head Fine-Tuning:
- Identify potential improvements in the fine-tuning setup (e.g., learning rate, optimizer, data augmentation, loss function, input size and embedding dimension, etc.).
- Implement changes while keeping them compatible with the current repository design.
- Consider Full Fine-Tuning (Baseline):
- Experiment with full fine-tuning of DinoV2 as a baseline to compare performance gains.
- Benchmark and Report Results:
- Measure improvements in accuracy and efficiency. At least 4-5 diverse datasets should be considered with different number of images.
- Provide recipe (config) for e2e working solution (from torch fine-tuning to OpenVINO IR model)