This is done as part of Deep Learning, Advanced Course DD2412 at KTH.
The aim of the project is replicate the results of the paper "Masked Autoencoders Are Scalable Vision Learners".
Our approach involved constructing an asymmetric encoder-decoder architecture with a ViT-B/16 encoder and a simple decoder. After self-supervised pre-training, we focused on supervised training for evaluating the representations with linear probing.
To replicate within limited compuattion resources, we restricted our experiments to Imagenette dataset.
Masked Autoencoders Are Scalable Vision Learners: https://arxiv.org/pdf/2111.06377.pdf