Weakly Supervised Sound Separation via Bi-Modal Semantic Similarity (ICLR 2024)

Official PyTorch Implementation

We propose a weakly supervised learning framework for conditional audio separation from natural mixtures (i.e. when no single source sound is available). In particular, we leverage bi-modal semantic similarity (from pre-trained CLAP model) to generate weak supervision on fine-grained source separation without having access to single source sounds.

Tanvir Mahmud*†, Saeed Ameezadeh†, Kazuhito Koishida, Diana Marculescu

In ICLR 2024. (* Work done in part during an internship at Microsoft Corporation, Redmond, USA, † equal contribution)

WebDemo | OpenReview | arXiv

(Left) The proposed conditional audio separation framework. (Right) The comparison of our framework and the mix-and-separate baseline in unsupervised and semi-supervised settings.

Setting Up Environments

conda create -n bisep python==3.9.12
conda activate bisep
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install -r requirements.txt

Dataset Preparation

AudioCaps Dataset

The AudioCaps dataset can be downloaded from AudioCaps. We provide AudioCaps captions and parsed sound source phrases here. Download this file and put it into "data/audiocaps/annotations" folder. We also provide the script used to parse the sound sources from AudioCaps captions in preprocessing/audiocaps/parser.py. Then, use preprocessing/audiocaps/download_audios.py script to download all audio files, and put downoladed audios into data/audiocaps/audio directory. Afterwards, prepare train and test split ids using preprocessing/audiocaps/create_split.py, and put the csv files in data/audiocaps/annotations directory.

VGGSound Dataset

Download audios from VGGSound dataset, and put in the data/vggsound/audio directory. We also provide the sample download script in preprocessing/vggsound/download_audios.py. Then, prepare annotation files test.csv, and train.csv using preprocessing/vggsound/create_index_files.py. Afterwards, prepare the test compositions using preprocessing/vggsound/create_test_composition.py. Put all annotations in data/music/annotations directory.

Music Dataset

Download videos from MUSIC dataset, and put in the data/music/video directory. Then, extract audios and frames using extract_audios.py and extract_frames.py provided in preprocessing/music directory. Afterwards, prepare test.csv, and train.csv using the preprocessing/music/create_index_files.py. Finally, prepare the test compositions file test_sep_2.csv using preprocessing/music/create_test_composition.py. Put all annotations in data/music/annotations directory.

The data directory should be like this:

    Data_Directory/
    ├── AudioCaps/
    │    ├── annotations/
    │    │   ├── parsed_all_caps.json
    │    │   ├── train_ids.csv
    │    │   └── test_sep2_ids.csv
    │    └── audio/
    │        ├── __0Fp4K-2Ew_60.wav
    │        ├── __8O7tZPwsI_20.wav
    │        └── __LerxtZ9ac_0.wav
    |
    ├── MUSIC/
    │    ├── annotations/
    │    │   ├── train.csv
    │    │   ├── test.csv
    │    │   └── test_sep_2.csv
    │    └── audio/
    │        ├── accordion
    │        |   ├── -DlGdZNAsxA.wav
    │        |    └── _jPFkOkNjuo.wav
    │        ├── acoustic_guitar
    │
    ├── VGGSound/
    │    ├── annotations/
    │    │   ├── train.csv
    │    │   ├── test.csv
    │    │   └── test_sep_2.csv
    │    └── audio/
    │        ├── accordion
    │        |   ├── -DlGdZNAsxA.wav
    │        |    └── _jPFkOkNjuo.wav
    │        ├── acoustic_guitar

Training Script

Here, the sample training script are provided for AudioCaps dataset. We also provide scripts for other datasets in scripts/ directory.

    python main.py --id Proposed_AC --mode train --list_train data/annotations/train_ids.csv \
                    --list_test data/annotations/test_sep2_ids.csv --audio_dir data/audio \
                    --cond_layer sca --num_cond_blocks 1 --num_res_layers 1 --num_head 8 \
                    --cond_dim 768 --num_downs 7 --num_channels 32 --num_mix 2 --audLen 131070 \
                    --audRate 16000 --workers 4 --batch_size 16 --lr 1e-4 --num_epoch 200 \
                    --lr_step 15 --disp_iter 20 --ckpt outputs --multiprocessing_distributed \
                    --ngpu 8 --recons_weight 5 --disp_iter 20 --dist-url tcp://localhost:12341 \
                    --warmup_epochs 1 --eval_epoch 2 --n_sources 3 \
                    --parsed_sources_path data/annotations/parsed_all_caps.json

Test Script

Here, the sample test script are provided for AudioCaps dataset. We also provide scripts for other datasets in scripts/ directory.

    python main.py --id Proposed_AC --mode test --list_train data/annotations/train_ids.csv \
                    --list_test data/annotations/test_sep2_ids.csv --audio_dir data/audio_16k \
                    --cond_layer sca --num_cond_blocks 1 --num_res_layers 1 --num_head 8 \
                    --cond_dim 768 --num_downs 7 --num_channels 32 --num_mix 2 --audLen 131070 \
                    --audRate 16000 --workers 4 --batch_size 16 --lr 1e-4 --num_epoch 200 \
                    --lr_step 15 --disp_iter 20 --ckpt outputs --multiprocessing_distributed \
                    --ngpu 8 --recons_weight 5 --disp_iter 20 --dist-url tcp://localhost:12341 \
                    --warmup_epochs 1 --eval_epoch 2 --n_sources 3 \
                    --parsed_sources_path data/annotations/parsed_all_caps.json

Demo

Pre-trained Models

Please download the pretrained models from model_weights and put it in ./pretrained_weights directory.

You can simply run the demo without setting up the dataset.

    python demo.py  --cond_layer sca --num_cond_blocks 1 --num_res_layers 1 --num_head 8 \
                    --cond_dim 768 --num_downs 7 --num_channels 32 --audLen 131070 \
                    --audRate 16000 --workers 4 --multiprocessing_distributed \
                    --ngpu 1 --dist-url tcp://localhost:12342 --samples_dir demo_samples \
                    --load pretrained_weights/model_weights.pth.tar

Citing

Please cite our paper if you find this repository useful.

@inproceedings{mahmud2024weakly,
    title={Weakly-supervised Audio Separation via Bi-modal Semantic Similarity},
    author={Tanvir Mahmud, Saeed Amizadeh, Kazuhito Koishida, and Diana Marculescu},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=4N97bz1sP6}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Acknowledgement

Our code is based on the implementations of SoP, CLAP, and CLIPSep. We used pre-trained audio-language grounding models from CLAP. We thank the authors for sharing their code. If you use our codes, please also cite their nice works.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dataset		dataset
demo_samples		demo_samples
figures		figures
models		models
preprocessing		preprocessing
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.md		LICENSE.md
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
arguments.py		arguments.py
demo.py		demo.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Weakly Supervised Sound Separation via Bi-Modal Semantic Similarity (ICLR 2024)

Official PyTorch Implementation

Setting Up Environments

Dataset Preparation

AudioCaps Dataset

VGGSound Dataset

Music Dataset

Training Script

Test Script

Demo

Pre-trained Models

Citing

Contributing

Trademarks

Acknowledgement

About

Licenses found

Releases

Packages

Contributors 3

Languages

License

Licenses found

microsoft/BiModalAudioSeparation

Folders and files

Latest commit

History

Repository files navigation

Weakly Supervised Sound Separation via Bi-Modal Semantic Similarity (ICLR 2024)

Official PyTorch Implementation

Setting Up Environments

Dataset Preparation

AudioCaps Dataset

VGGSound Dataset

Music Dataset

Training Script

Test Script

Demo

Pre-trained Models

Citing

Contributing

Trademarks

Acknowledgement

About

Resources

License

Licenses found

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages