This repository provides the PyTorch implementation for the ECCV2022 paper "Audio-Visual Segmentation" . This paper proposes the audio-visual segmentation (AVS) problem and the AVSBench dataset accordingly. [Project Page] [arXiv]
Recently, we expanded the AVS task to include one more challenging setting, i.e., the fully-supervised audio-visual semantic segmentation (AVSS) that requires generating semantic masks of the sounding objects. Accordingly, we collected a new AVSBench-semantic dataset. Please refer to our arXiv paper ''Audio-Visual Segmentation with Semantics'' for more details. [Online Benchmark]
- (2023.1.31) The AVSBench-semantic dataset has been released, you can download it from our official benchmark website. Please refer to our arXiv paper for more details of this dataset.
- (2022.10.18) We have completed the collection and annotation of AVSBench-semantic. Compared to original AVSBench dataset, it contains ~7k more multi-source videos covering 70 categories, and the ground truths are provided in the form of multi-label semantic maps (labels of origianl AVSBench dataset are also updated). We will release it as soon as possible.
- (2022.7.13) We are preparing the AVSBench-semantic dataset that will pay more attention to multi-source situation and provide semantic annotations.
The AVSBench dataset is first proposed in our ECCV paper. It contains a Single-source and a Multi-sources subset. Ground truths of these two subsets are binary segmentation maps indicating pixels of the sounding objects. Recently, we collected a new Semantic-labels subset that provides semantic segmentation maps as labels. We add it to the original AVSBench dataset as the third subset. For convenience, we denote the original AVSBench dataset as AVSBench-object, and the newly added Semantic-labels subset as AVSBench-semantic.
AVSBench-object is used for the Single Sound Source Segmentation (S4) and Multiple Sound Source Segmentation (MS3), while AVSBench-semantic is used for the Audio-Visual Semantic Segmentation (AVSS).
The updated AVSBench dataset is available at http://www.avlbench.opennlplab.cn/download. You may request the dataset by mail at opennlplab@gmail.com.. We will reply as soon as we receive the application.
These downloaded data should be placed to the directory avsbench_data
.
The pretrained ResNet50/PVT-v2-b5 (vision) and VGGish (audio) backbones can be downloaded from here and placed to the directory pretrained_backbones
.
Notice: please update the path of data and pretrained backbone in avs_s4/config.py
, avs_ms3/config.py
, and avss/config.py
accordingly.
- Train AVS Model
cd avs_scripts/avs_s4
bash train.sh
- Test AVS Model
cd avs_scripts/avs_s4
bash test.sh
- Train AVS Model
cd avs_scripts/avs_ms3
bash train.sh
- Test AVS Model
cd avs_scripts/avs_ms3
bash test.sh
- Train AVS model
cd avs_scripts/avss
bash train.sh
- Test AVS model
cd avs_scripts/avss
bash test.sh
Notably, the AVSS setting can be viewed as an independent task by the research community, i.e., the audio-visual semantic segmentation task. The pretrained AVSS models are available at here.
If you use this dataset or code, please consider cite following papers:
@inproceedings{zhou2022avs,
title = {Audio-Visual Segmentation},
author = {Zhou, Jinxing and Wang, Jianyuan and Zhang, Jiayi and Sun, Weixuan and Zhang, Jing and Birchfield, Stan and Guo, Dan and Kong, Lingpeng and Wang, Meng and Zhong, Yiran},
booktitle = {European Conference on Computer Vision},
year = {2022}
}
@article{zhou2023avss,
title={Audio-Visual Segmentation with Semantics},
author={Zhou, Jinxing and Shen, Xuyang and Wang, Jianyuan and Zhang, Jiayi and Sun, Weixuan and Zhang, Jing and Birchfield, Stan and Guo, Dan and Kong, Lingpeng and Wang, Meng and Zhong, Yiran},
journal={arXiv preprint arXiv:2301.13190},
year={2023},
}
This project is released under the Apache 2.0 license as found in the LICENSE file.