Paper: arXiv
ML Code Completeness Checklist:
- Specification of dependencies
- Training code
- Evaluation code
- Pre-trained models
- README file including table of results accompanied by precise commands to run/produce those results
Please refer requirements.txt.
To install,
$ pip install -r requirements.txt 
- Download the raw VQA 2.0 dataset from the official website.
Make sure that your data directory looks similar to the following structure (you can change the paths if you want a different structure in train.py).
- These instructions are from LXMERT repo. Download the re-distributed JSON files.
mkdir -p data/vqa
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vqa/train.json -P data/vqa/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vqa/nominival.json -P  data/vqa/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vqa/minival.json -P data/vqa/
For downloading FasterRCNN features, use these instructions:
mkdir -p data/mscoco_imgfeat
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/train2014_obj36.zip -d data/mscoco_imgfeat && rm data/mscoco_imgfeat/train2014_obj36.zip
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/val2014_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/val2014_obj36.zip -d data && rm data/mscoco_imgfeat/val2014_obj36.zip
If the links don't work, you can use Google drive link to get access. For more details, please refer LXMERT repo.
Setup the directory structure like this:
In /home/user/
+-- data
|   +-- lxmert
|   +-- mscoco_imgfeat
|   +-- vqa
+-- adaptive_transformer
+-- snap
.......
Create a directory snap, that's where checkpoints will be store by default.
All of this structure can be changed but suitable modifications will be needed in train.py.
FasterRCNN features are loaded all at once in the RAM, so you'd require an instance with >48 GB of RAM. For training, I used a single P100 Nvidia GPU.
Please download the pretrained models from this Google drive link
Alternatively, if you want to train (finetune) the model yourself, download the pretrained weights from here. Skip this step if you're using my weights.
$ git clone https://github.com/prajjwal1/adaptive_transformer
$ cd adaptive_transformer
$ python3 train.py --bs=128 --epochs=1 --sparse --tiny #test script
If this worked well, then you're ready to train.
Usage:
python train.py
    [--bs]            # Specify the batch size
    [--epochs]        # Specify the epochs
    [--tiny]          # Runs a test example (for debugging purposes)  
    [--adaptive]      # Uses Adaptive Attention Span
    [--sparse]        # Uses Entmax from Adaptively Sparse Transformers instead of softmax
    [--layerdrop]     # Enables layerdrop
    [--load_model]    # Resume training by specifying a checkpoint
    [--test]          # Dumps a JSON file for submission to VQA servers.
More customizations can be done by modifying the params and config dict in train.py.
It looks like this
params = {
    "adapt_span_enabled": args.adaptive,
    "attn_span": 1024,
    "adapt_span_loss_coeff": 0.000005,
    "adapt_span_ramp": 32,
    "adapt_span_init": 0.002,
    "adapt_span_cache": True,
    "nb_heads": 12,
    "bs": args.bs,
    "mask_size": [20, 36],
    "sparse_enabled": args.sparse,
    "num_attention_heads": 4,
    "layer_sizes": {"lang": 9, "cross": 5, "vision": 5},
    "from_scratch": False,
    "layerdrop_enabled": args.layerdrop,
    "layerdrop_num_layers": 1,
}
config = {
    "adaptive_enable": args.adaptive,
    "sparse_enable": args.sparse,
    "measure_flops": False,
    "load_model": args.load_model,
}
Please check the params dict when starting training to see the configurations. Config should match with the config used in loaded model.
Remove the tiny flag to train on whole dataset.
python train.py --bs=128 --epochs=1 --adaptive --tiny
By default, attention spans of each layer is printed so that you can track it.
If sparse flag is enabled, softmax will be replaced with entmax to compute probability distribution of attention weights.
python train.py --bs=128 --epochs=1 --sparse --tiny
python train.py --bs=128 --epochs=1 --layerdrop --tiny
Specify the following as per use case in train.py:
- params['layerdrop_num_layers']# Number of layers to drop
- params['layer_sizes']# Number of layers you require
NOTE: Number of layers params['layer_sizes'] have to match with number of layers in the model checkpoint. To perform pruning during inference, default learn.load method is not suitable as it loads all the layers. Please refer to this fairseq issue to perform pruning during inference.
To load a model trained with adaptive or sparse or layerdrop flag:
python train.py --bs=128 --epochs=1 --adaptive --tiny --load_model=adaptive_6910
python train.py --bs=128 --epochs=1 --sparse --tiny --load_model=sparse_7
python train.py --bs=128 --epochs=1 --layerdrop --load_model=layerdrop_1066_ldrop_1 --tiny
python train.py --bs=128 --test --adaptive --load_model=adaptive_6910
When test flag is passed, only inference is performed on the test set. Ground truths for test set for VQA are not publicly available. This command will dump the JSON file in the /snap directory. Submit the JSON file through the EvalAI competition page.
- dataset: contains standard Pytorch dataset class for VQA
- models: Contains implmentation of adaptive mechanisms and LXMERT
- nbs: Probably the most interesting part. Use this to understand my workflow, attention methods I used. I used these notebooks to develop this codebase. You can also use these to understand how attention works in this context and much more.
- optimizers: implementation of LAMB and Lookahead optimizer
- pretrain: utility tools
- train.py: Specifies how training and testing to be carried out. You'd probably want to modify this to adapt to your work.
- learner.py: Implements a Learner class to control all functionalities of this codebase.
- run_train.sh: You can modify this to setting hardware specific training (Optional)
- run_test.sh: Set of tests (for me).
Please refer to nbs/inference.ipynb to load your trained model, obtain predictions and visualize the results.
These results can be reproduced by using the scripts I provided above and using the same params and config dict values.
Our model achives the following performance on the VQA 2.0 benchmark:
| Model                                 | test-dev | test-std |
|---------------------------------------|----------|----------|
| LXMERT                                |          |          |
| w/ softmax                            | 72.42    | 72.54    |
| w/ Adaptive Attention Span            | 71.62    | 71.72    |
| w/ Adaptive Sparse                    | 71.73    | 71.97    |
| w/ Layerdrop (10-6-6, p=1)            | 66.4     | 66.72    |
| w/ Layerdrop (10-6-6, p=0)            | 66.35    | 66.57    |
| w/ Layerdrop (9-5-5, p=1)             | 66.51    | 66.81    |
| w/ Adaptive Attention Span and Entmax | 63.07    | 63.33    |
If you use this work in any form, please cite the paper:
@inproceedings{bhargava-2020-adaptive,
    title = "Adaptive Transformers for Learning Multimodal Representations",
    author = "Bhargava, Prajjwal",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-srw.1",
    doi = "10.18653/v1/2020.acl-srw.1",
    pages = "1--7",
    abstract = "The usage of transformers has grown from learning about language semantics to forming meaningful visiolinguistic representations. These architectures are often over-parametrized, requiring large amounts of computation. In this work, we extend adaptive approaches to learn more about model interpretability and computational efficiency. Specifically, we study attention spans, sparse, and structured dropout methods to help understand how their attention mechanism extends for vision and language tasks. We further show that these approaches can help us learn more about how the network perceives the complexity of input sequences, sparsity preferences for different modalities, and other related phenomena.",
}
- Code for LXMERT Model was adapted from LXMERT repo.
- Entmax autograd function implementation was adapted from entmax repo