This repository contains the code and dataset in our NeurIPS'20 paper.
Learning Representations from Audio-Visual Spatial Alignment. Pedro Morgado*, Yi Li*, Nuno Vasconcelos. Advances in Neural Information Processing Systems (NeurIPS), 2020.
Requirements listed in environment.yml
.
YouTube id's of videos in the YT-360 dataset are provided in datasets/assets/yt360/[train|test].txt
, and segment timestamps in datasets/assets/yt360/segments.txt
.
Please, use your favorite YouTube dataset downloader to download the videos (e.g.~link), and split them into 10s clips.
The dataset should be stored in data/yt360/video
and data/yt360/audio
with filenames {YOUTUBE_ID}-{SEGMENT_START_TIME}.{EXTENSION}
.
The pre-extracted segmentation maps can be downloaded from here and extracted to data/yt360/segmentation/
.
If you experience issues downloading or processing the dataset, please email the authors at {pmaravil, yil898}@eng.ucsd.edu for assistance.
The AVSA model that yield the top performance (trained from configs/main/avsa/Cur-Loc4-TransfD2.yaml
) is available here.
python main-video-ssl.py [--quiet] cfg
Training config cfg
for the following models are provided:
- AVC training (instance discrimination):
configs/main/avsa/InstDisc.yaml
- AVSA training:
configs/main/avsa/Cur-Loc4-TransfD2.yaml
- AVSA training w/o curriculum:
configs/main/avsa/NoCur-Loc4-TransfD2.yaml
Four downstream tasks are supported: Binary audio-visual correspondence (AVC-Bin), binary audio-visual spatial alignment (AVSA-Bin), video action recognition (on UCF/HMDB), and audio-visual semantic segmentation.
python eval-action-recg.py [--quiet] cfg model_cfg
Evaluation config cfg
for UCF and HMDB dataset are provided:
- UCF:
configs/benchmark/ucf/ucf-8at16-fold[1|2|3].yaml
- HMDB:
configs/benchmark/hmdb/hmdb-8at16-fold[1|2|3].yaml
model_cfg
is training config for the model to evaluate, e.g. configs/main/avsa/Cur-Loc4-TransfD2.yaml
for AVSA pre-training.
python eval-audiovisual-segm.py [--quiet] cfg model_cfg
Evaluation config cfg
for three settings are provided:
- Visual segmentation:
configs/benchmark/segmentation/yt360-fpn-4crop-head-vonly.yaml
- Visual+audio segmentation:
configs/benchmark/segmentation/yt360-fpn-4crop-head-audio.yaml
- Visual+audio segmentation with context:
configs/benchmark/segmentation/yt360-fpn-4crop-head-audio-ctx.yaml
python eval-avc.py [--quiet] cfg model_cfg
Evaluation config cfg
for two settings are provided:
- With transformer:
configs/benchmark/avc/avc-transf-[1|4]crop.yaml
- Without transformer:
configs/benchmark/avc/avc-notransf-[1|4]crop.yaml
python eval-avsa.py [--quiet] cfg model_cfg
Evaluation config cfg
for two settings are provided:
- With transformer:
configs/benchmark/avsa/avsa-transf-[1|4]crop.yaml
- Without transformer:
configs/benchmark/avsa/avsa-notransf-[1|4]crop.yaml
Please cite our work if you find it helpful for your research:
@article{morgado2020learning,
title={Learning Representations from Audio-Visual Spatial Alignment},
author={Morgado, Pedro and Li, Yi and Nvasconcelos, Nuno},
journal={Advances in Neural Information Processing Systems},
volume={33},
year={2020}
}
This work was partially funded by NSF award IIS-1924937 and NVIDIA GPU donations. We also acknowledge and thank the use of the Nautilus platform for some of the experiments in paper.