Audio-Visual Class-Incremental Learning

We introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition, and propose a method AV-CIL. [paper]

Environment

We conduct experiments with Python 3.8.13 and Pytorch 1.13.0.

To setup the environment, please simply run

pip install -r requirements.txt

Datasets

AVE

The original AVE dataset can be downloaded through link.

Please put the downloaded AVE videos in ./raw_data/AVE/videos/.

Kinetics-Sounds

The original Kinetics dataset can be downloaded through link. After downloading the Kinetics dataset, please apply our provided video id list (here) to extract the Kinetics-Sounds dataset used in our experiments.

Please put the downloaded videos in ./raw_data/kinetics-sounds/videos/.

VGGSound100

The original VGGSound dataset can be downloaded through link. After downloading the VGGSound dataset, please apply our provided video id list (here) to extract the VGGSound100 dataset used in our experiments.

Please put the downloaded videos in ./raw_data/VGGSound/videos/.

Extract audio and frames

After downloading the datasets to the folds, please run the following command to extract the audios and frames

sh extract_audios_frames.sh 'dataset'

where the 'dataset' should be in [AVE, ksounds, VGGSound_100].

Pre-trained models

For the audio encoder, please download the pre-trained AudioMAE and put it in ./model/pretrained/.

Feature extraction

For the pre-trained audio features extraction, please run

sh extract_pretrained_features 'dataset'

where the 'dataset' should be in [AVE, ksounds, VGGSound_100].

For the running environment of the AudioMAE, we follow the official implementation and use timm==0.3.2, for which a fix is needed to work with Pytorch 1.8.1+.

(option) Use our extracted features directly

We also released the pre-trained features, you can use them directly instead of pre-processing and extracting them from the raw data: AVE, Kinetics-Sounds [part-1, part-2, part-3], VGGSound100[part-1, part-2, part-3, part-4, part-5, part-6].

For Kinetics-Sounds and VGGSound100, please download all the parts and concatenate them before unzipping.

After obtaining the pre-trained audio and visual features, please put them to ./data/'dataset'/audio_pretrained_feature/ and ./data/'dataset'/visual_pretrained_feature/.

Training & Evaluation

For vanilla fine-tuning strategy, please run

sh run_incremental_fine_tuning.sh 'dataset' 'modality'

where the 'dataset' should be in [AVE, ksounds, VGGSound_100], and the 'modality' should be in [audio, visual, audio-visual].

For the upper bound, please run

sh run_incremental_upper_bound.sh 'dataset' 'modality'

For LwF, please run

sh run_incremental_lwf.sh 'dataset' 'modality'

For iCaRL, please run

sh run_incremental_lwf.sh 'dataset' 'modality' 'classifier'

where the 'classifier' should be in [NME, FC].

For SS-IL, please run

sh run_incremental_ssil.sh 'dataset' 'modality'

For AFC, please run

sh run_incremental_afc.sh 'dataset' 'modality' 'classifier'

where the 'classifier' should be in [NME, LSC].

For our AV-CIL, please run

sh run_incremental_ours.sh 'dataset'

Citation

If you find this work useful, please consider citing it.

@inproceedings{pian2023audio,
  title={Audio-Visual Class-Incremental Learning},
  author={Pian, Weiguo and Mo, Shentong and Guo, Yunhui and Tian, Yapeng},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={7799--7811},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio-Visual Class-Incremental Learning

Environment

Datasets

AVE

Kinetics-Sounds

VGGSound100

Extract audio and frames

Pre-trained models

Feature extraction

(option) Use our extracted features directly

Training & Evaluation

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
AFC		AFC
LwF		LwF
SSIL		SSIL
data		data
fine_tuning		fine_tuning
iCaRL		iCaRL
images		images
model		model
ours		ours
raw_data		raw_data
utils		utils
.gitignore		.gitignore
README.md		README.md
extract_audios_frames.sh		extract_audios_frames.sh
extract_pretrained_features.sh		extract_pretrained_features.sh
requirements.txt		requirements.txt
run_incremental_afc.sh		run_incremental_afc.sh
run_incremental_fine_tuning.sh		run_incremental_fine_tuning.sh
run_incremental_icarl.sh		run_incremental_icarl.sh
run_incremental_lwf.sh		run_incremental_lwf.sh
run_incremental_ours.sh		run_incremental_ours.sh
run_incremental_ssil.sh		run_incremental_ssil.sh
run_incremental_upper_bound.sh		run_incremental_upper_bound.sh

weiguoPian/AV-CIL_ICCV2023

Folders and files

Latest commit

History

Repository files navigation

Audio-Visual Class-Incremental Learning

Environment

Datasets

AVE

Kinetics-Sounds

VGGSound100

Extract audio and frames

Pre-trained models

Feature extraction

(option) Use our extracted features directly

Training & Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages