Code repository for the paper:
STAViS: Spatio-Temporal AudioVisual Saliency Network
Antigoni Tsiami,
Petros Koutras,
Petros Maragos
CVPR 2020
[paper][supp][arxiv][project page]
If you use this code or the trained models, please cite the following:
@InProceedings{Tsiami_2020_CVPR,
author = {Tsiami, Antigoni and Koutras, Petros and Maragos, Petros},
title = {STAViS: Spatio-Temporal AudioVisual Saliency Network},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}
- Python 3 : tested with 3.5
- PyTorch : tested with versions from 0.4 to 1.4.0
- GPU support (ideally multiple) : tested with 4 NVIDIA GTX1080Ti or RTX2080Ti
- MATLAB (for computing the evaluation metrics) : tested with R2015a or newer
You can install the required python packages with the command:
pip install -r requirements.txt --user
For the training and evaluation of the STAViS network we have employed 5 publicly available datasets with eyetracking annotations. We encourage those interested to visit the original sources and site the appropriate references if they use these data.
For the easily reproduction of STAViS results we provide the extracted video frames and audio clips as well as the preprocessed ground truth saliency maps.
Please visit the project page to download the pre-trained models as well as the data and the related files.
You can also run the following script that downloads and extract all the required meterial:
bash fetch_data.sh
Assume the structure of data directories is that provided by the script fetch_data.py
.
STAViS/
data/
video_frames/
.../ (directories of datasets names)
video_audio/
.../ (directories of datasets names)
annotations/
.../ (directories of datasets names)
fold_lists/
*.txt (lists of datasets splits)
pretrained_models/
stavis_visual_only/
visual_split1_save_60.pth
visual_split2_save_60.pth
visual_split3_save_60.pth
stavis_audiovisual/
audiovisual_split1_save_60.pth
audiovisual_split2_save_60.pth
audiovisual_split3_save_60.pth
resnet-50-kinetics.pth
soundnet8.pth
Confirm all options for the STAViS parameters:
python main.py -h
If you use less than our default 4 GPUs you should modify the --gpu_devices 0,1,2,3 --batch_size 128 --n_threads 12
accordingly.
- Train STAViS audiovisual models for all splits and produce the resulting saliency maps for the test sets:
bash run_all_splits_audiovisual_train_test.sh
- Produce saliency maps for all splits' test sets using our trained STAViS audiovisual models:
bash run_all_splits_audiovisual_test.sh
- Produce saliency maps for all splits' test sets using our trained STAViS visual only models (for comparisons):
bash run_all_splits_visual_only_test.sh
For the computation of the diffenent measures employed in the evaluation we used MATLAB functions from (https://github.com/cvzoya/saliency/tree/master/code_forMetrics):
git clone https://github.com/cvzoya/saliency.git
mv saliency/code_forMetrics ./eval_code/
rm -rf saliency
The main evaluation script is compute_all_databases.sh
that runs with the full root path and the path where network prediction are saved as arguments.
For example, if the project root folder is /home/test/STAViS
and the experiment's predictions are saved at experiments/audiovisual_train_test/
:
sh compute_all_databases.sh /home/test/STAViS experiments/audiovisual_train_test/
This script creates 6 scripts, one for each database, containing individual Matlab experiments for each video evaluation. (Note that, if familiar with a framework like Grid Engine, these scripts can run in parallel, to save computational time.) The results per video and split are saved in the experiment folder, under the name results_per_video. Next, after all evaluations are finished, results are gathered together, per database, and a final result for each metric is printed on the screen and on a file called final_results_$databasename.txt
[2] P. Koutras and P. Maragos. SUSiNet: See, Understand and Summarize it. CVPRW 2019.
Our code is released under the MIT license.
Please contact Antigoni Tsiami at antsiami@cs.ntua.gr in case you have any questions or suggestions.