This repository contains the source code, models and data files for the work titled: Unsupervised Image Style Embeddings for Retrieval and Recognition Tasks (accepted at WACV 2020).
Please visit our project page for more details: https://sidgairo18.github.io/style
* Python3
* PyTorch (and other dependencies for PyTorch)
* Numpy
* OpenCV 3.3.0
* Visdom Line Plotter
* tqdm
* cudnn (CUDA for training on GPU)
These are all easily installable via, e.g., pip install numpy
. Any reasonably recent version of these packages should work. It is recommended to use a python virtual
environment to setup the dependencies and the code.
Feature Extraction
For feature extraction using pre-trained VGG network and PCA reduction use the following repo.
Clustering Feature extraction is followed by KMeans clustering. The optimal number of clusters for each dataset are determined using the elbow method.
- We train a CNN with augmented by a 256-dimensional bottleneck layer
- The training proceeds for 30 epochs and minimize cross-entropy loss for multi-class classification.
- We stop this after 30 epochs and the weights are saved which are used later in Stage 2.
- During this stage, we simply use the cluster ID for each image as its class label.
- Hyperparameters: lr = 0.001, Adam optimizer, Categorical Cross-Entropy (These are emperically chosen).
- Python script for this part is
classification_net.py
- Stage 2 of the pipeline, requries training a Triplet Convnet with Triplet Loss (MarginRanking loss)
- For this we require an anchor image, a positive sample and a negative sample. (How these images are sampled is explained in section 3.1.2 of the Paper)
- We train this triplet network for 50 epochs (Hyperparameters: lr = 0.001, SGD optimizer, MarginRanking Loss).
- The model weights from Stage 1 are loaded before the training for Stage 2 is started.
- Python script for this part is
train.py
- For more information on the Triplet Network and embedding networks, take a look at
networks.py
andtriplet_network.py
files
Note 1: The bottle neck layer chosen has 256 dimensions (from experiments it was seen 256 dimensions instead of 128 makes not much difference in performance).
Note 2: The code may be slightly different from the parameters mentioned in the paper, but is sufficient to reproduce the results given in the paper (this is based on my unofficial implmentation of the work and it's public code.
- For Stage 1, run
python classification.py
- For Stage 2, run
python train.py
- For details on the data-loader and data text files see next section.
- the
classification_dataloader
expects 2 files:filenames_filename
andlabels_filename
. filenames_filename
=> A text file with each line containing a path to an image, e.g.,images/class1/sample.jpg
labels_filename
=> A text file with each line containing 1 integer, label index of the image.- Similarly the
triplet_dataloader
expects 2 files:filenames_filename
andtriplets_filename
. filenames_filename
=> A text file with each line containing a path to an image, e.g.,images/class1/sample.jpg
triplets_filename
=> A text file with each line containing 3 integers, where integer i refers to the i-th image infilenames_filename
. For a line with integers "a b c", a triplet is defined such that image a is more similar to image c than it is to image b.
For the information on the dataset and splits used please go over Sec 4 of the paper, and supplementary material.
The datasets used are:
- BAM: Behance Artistic Media dataset, we use a subset of BAM dataset with 121K images (sampled similar to Behance-Net-TT 110K as in work) balanced across media and emotional styles, and with a Train:Val:Test split as 80:5:15.
- AVA Style Dataset Train:Val:Test split 85:5:10
- Flickr: Train:Val:Test split 60:20:20
- Wikipaintings: Train:Val:Test split 85:5:10
- DeviantArt: Train:Val:Test split 85:5:10
- WallArt
- 1 to request access to the dataset please visit the BAM website here
- 2, 3, 4 can be downloaded from here.
- 5, 6 have not been released publically yet due to licensing issues. But can be easily recreated as described in the paper and can be accessed at the respective websites.
t-sne for feature visualizations
- The different dataset images have not been included.
- The text files in the data folder are just for reference. They may vary according to your own data files.
- To request access to the Dataset please visit the BAM website and refer to the notes under Dataset section.
- Feel free to use this code for your own work, but please cite the work if using this work officially.
- In case of any bugs or errors, please be gracious enough to report an issue on this repo.
@InProceedings{Gairola_2020_WACV,
author = {Gairola, Siddhartha and Shah, Rajvi and Narayanan, P. J.},
title = {Unsupervised Image Style Embeddings for Retrieval and Recognition Tasks},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
month = {March},
year = {2020}
}
We distriute the source code under the MIT License.