PyTorch starter code for the Visual Dialog Challenge 2019.
If you use this code in your research, please consider citing:
@misc{desai2018visdialch,
author = {Karan Desai and Abhishek Das and Dhruv Batra and Devi Parikh},
title = {Visual Dialog Challenge Starter Code},
howpublished = {\url{https://github.com/batra-mlp-lab/visdial-challenge-starter-pytorch}},
year = {2018}
}
If you are a returning user (from Visual Dialog Challenge 2018), here are some key highlights about our offerings in v2019
of this starter code:
- Almost a complete rewrite of
v2018
, which increased speed, readability, modularity and extensibility. - Multi-GPU support - try out specifying GPU ids to train/evaluate scripts as:
--gpu-ids 0 1 2 3
- Docker support - we provide a Dockerfile which can help you set up all the dependencies with ease.
- Stronger baseline - our Late Fusion Encoder is equipped with Bottom-up Top-Down attention. We also provide pre-extracted image features (links below).
- Minimal pre-processed data - no requirement to download tens of pre-processed data files anymore (were typically referred as
visdial_data.h5
andvisdial_params.json
).
This starter code is implemented using PyTorch v1.0, and provides out of the box support with CUDA 9 and CuDNN 7. There are two recommended ways to set up this codebase: Anaconda or Miniconda, and Docker.
- Install Anaconda or Miniconda distribution based on Python3+ from their downloads' site.
- Clone this repository and create an environment:
git clone https://www.github.com/batra-mlp-lab/visdial-challenge-starter-pytorch
conda create -n visdialch python=3.6
# activate the environment and install all dependencies
conda activate visdialch
cd visdial-challenge-starter-pytorch/
pip install -r requirements.txt
# install this codebase as a package in development version
python setup.py develop
Note: Docker setup is necessary if you wish to extract image features using Detectron.
We provide a Dockerfile which creates a light-weight image with all the dependencies installed.
- Install nvidia-docker, which enables usage of GPUs from inside a container.
- Build the image as:
cd docker
docker build -t visdialch .
- Run this image in a container by setting user+group, attaching project root (this codebase) as a volume and setting shared memory size according to your requirements (depends on the memory usage of your model).
nvidia-docker run -u $(id -u):$(id -g) \
-v $PROJECT_ROOT:/workspace \
--shm-size 16G visdialch /bin/bash
We recommend this development workflow, attaching the codebase as a volume would immediately reflect source code changes inside the container environment. We also recommend containing all the source code for data loading, models and other utilities inside visdialch
directory. Since it is a setuptools-style package, it makes handling of absolute/relative imports and module resolving less painful. Scripts using visdialch
can be created anywhere in the filesystem, as far as the current conda environment is active.
-
Download the VisDial v1.0 dialog json files from here and keep it under
$PROJECT_ROOT/data
directory, for default arguments to work effectively. -
Get the word counts for VisDial v1.0 train split here. They are used to build the vocabulary.
-
We also provide pre-extracted image features of VisDial v1.0 images, using a Faster-RCNN pre-trained on Visual Genome. If you wish to extract your own image features, skip this step and download VIsDial v1.0 images from here instead. Extracted features for v1.0 train, val and test are available for download at these links.
features_faster_rcnn_x101_train.h5
: Bottom-up features of 36 proposals from images oftrain
split.features_faster_rcnn_x101_val.h5
: Bottom-up features of 36 proposals from images ofval
split.features_faster_rcnn_x101_test.h5
: Bottom-up features of 36 proposals from images oftest
split.
- We also provide pre-extracted FC7 features from VGG16, although the
v2019
of this codebase does not use them anymore.
features_vgg16_fc7_train.h5
: VGG16 FC7 features from images oftrain
split.features_vgg16_fc7_val.h5
: VGG16 FC7 features from images ofval
split.features_vgg16_fc7_test.h5
: VGG16 FC7 features from images oftest
split.
This codebase supports both generative and discriminative decoding; read more here. For reference, we have Late Fusion Encoder from the Visual Dialog paper.
We provide a training script which accepts arguments as config files. The config file should contain arguments which are specific to a particular experiment, such as those defining model architecture, or optimization hyperparameters. Other arguments such as GPU ids, or number of CPU workers should be declared in the script and passed in as argparse-style arguments.
Train the baseline model provided in this repository as:
python train.py --config-yml configs/lf_disc_faster_rcnn_x101.yml --gpu-ids 0 1 # provide more ids for multi-GPU execution other args...
To extend this starter code, add your own encoder/decoder modules into their respective directories and include their names as choices in your config file. We have an --overfit
flag, which can be useful for rapid debugging. It takes a batch of 5 examples and overfits the model on them.
This script will save model checkpoints at every epoch as per path specified by --save-dirpath
. Refer visdialch/utils/checkpointing.py for more details on how checkpointing is managed.
We use Tensorboard for logging training progress. Recommended: execute tensorboard --logdir /path/to/save_dir --port 8008
and visit localhost:8008
in the browser.
Evaluation of a trained model checkpoint can be done as follows:
python evaluate.py --config-yml /path/to/config.yml --load-pthpath /path/to/checkpoint.pth --split val --gpu-ids 0
This will generate an EvalAI submission file, and report metrics from the Visual Dialog paper (Mean reciprocal rank, R@{1, 5, 10}, Mean rank), and Normalized Discounted Cumulative Gain (NDCG), introduced in the first Visual Dialog Challenge (in 2018).
The metrics reported here would be the same as those reported through EvalAI by making a submission in val
phase. To generate a submission file for test-std
or test-challenge
phase, replace --split val
with --split test
.
Performance on v1.0 test-std
(trained on v1.0
train + val):
Model | R@1 | R@5 | R@10 | MeanR | MRR | NDCG |
---|---|---|---|---|---|---|
lf-disc-faster-rcnn-x101 | 0.4617 | 0.7780 | 0.8730 | 4.7545 | 0.6041 | 0.5162 |
lf-gen-faster-rcnn-x101 | 0.3620 | 0.5640 | 0.6340 | 19.4458 | 0.4657 | 0.5421 |
- This starter code began as a fork of batra-mlp-lab/visdial-rl. We thank the developers for doing most of the heavy-lifting.
- The Lua-torch codebase of Visual Dialog, at batra-mlp-lab/visdial, served as an important reference while developing this codebase.
- Some documentation and design strategies of
Metric
,Reader
andVocabulary
classes are inspired from AllenNLP, It is not a dependency because the use-case in this codebase would be too little in its current state.