Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation
Jinbae Seo, Hyeongjun Kwon, Kwonyoung Kim, Jiyoung Lee and Kwanghoon Sohn
The official pytorch implementation of ACVIS
y2zr2xeTEx4-1694874243365_segment.mp4
87Rxhx_VqBQ-1694963122882-1695388384196.mp4_segment.mp4
00000768_segment.mp4
E0zx8GIBI_Y_segment.mp4
conda create --name acvis python=3.8 -y
conda activate acvis
conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -U opencv-python
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
pip install -r requirements.txt
pip install timm
cd mask2former/modeling/pixel_decoder/ops
sh make.shDownload and unzip AVISeg datasets and put them in ./datasets.
Download and unzip pre-trained backbones OneDrive and put them in ./pre_models.
Download the following checkpoints and put them in ./checkpoints.
| Backbone | Pre-trained Datasets | mAP | HOTA | FSLA | Model Weight |
|---|---|---|---|---|---|
| ResNet-50 | ImageNet | 42.01 | 62.04 | 42.43 | ACVIS_R50_IN.pth |
| ResNet-50 | ImageNet & COCO | 46.64 | 65.02 | 46.72 | ACVIS_R50_COCO.pth |
python train_net.py --num-gpus 2 --config-file configs/acvis/acvis_saoc.yaml
python train_net.py --config-file configs/acvis/acvis_saoc.yaml --eval-only MODEL.WEIGHTS checkpoints/ACVIS_R50_COCO.pth
python demo_video/demo.py --config-file configs/acvis/acvis_saoc.yaml --opts MODEL.WEIGHTS checkpoints/ACVIS_R50_COCO.pth
@misc{seo2025acvis,
title={Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation},
author={Jinbae Seo and Hyeongjun Kwon and Kwonyoung Kim and Jiyoung Lee and Kwanghoon Sohn},
year={2025},
eprint={2509.22740},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2509.22740},
}
Our implementation is based on Detectron2, Mask2Former, VITA and AVIS. Thanks for their great works.
