This repository contains the PyTorch implementation of our ECCV-22 paper and the associated datasets:
Active Audio-Visual Separation of Dynamic Sound Sources
Sagnik Majumder and Kristen Grauman
The University of Texas at Austin, Meta AI Research
Project website: https://vision.cs.utexas.edu/projects/active-av-dynamic-separation
Move2Hear: Active Audio-Visual Source Separation
We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple time-varying audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, using self-attention to make high-quality estimates for current timesteps and also simultaneously improve its past estimates. Using highly realistic acoustic SoundSpaces simulations in real-world scanned Matterport3D environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target.
This code has been tested with python 3.6.13
, habitat-api 0.1.4
, habitat-sim 0.1.4
and torch 1.4.0
. Additional python package requirements will be made available in requirements.txt
.
First, install the required versions of habitat-api, habitat-sim and torch inside a conda environment.
Next, install the remaining dependencies either by
pip3 install -r requirements.txt
or by parsing requirements.txt
to get the names and versions of individual dependencies and install them individually.
Download the project-specific datasets from this link, extract the tar.gz and put it under the project root. The extracted data
directory should have 3 types of data
- audio_data: the pre-processed and pre-normalized raw monaural audio waveforms for training and evaluation
- passive_datasets: the dataset (audio source and receiver pair spatial attributes) for pre-training of passive separators
- active_datasets: the dataset (episode specification) for training of Move2Hear policies
Make a directory named sound_spaces
and place it in the same directory as the one where the project root resides. Download the SoundSpaces Matterport3D binaural RIRs and metadata, and extract them into directories named sound_spaces/binaural_rirs/mp3d
and sound_spaces/metadata/mp3d
, respectively.
Download the Matterport3D dataset, and cache the observations relevant for the SoundSpaces simulator using this script from the SoundSpaces repository. Use resolutions of 128 x 128
for both RGB and depth sensors. Place the cached observations for all scenes (.pkl files) in sound_spaces/scene_observations
. Also, copy or symlink the scene directories under habitat_data/v1/tasks/mp3d_with_semantics
of the downloaded MP3D dataset folder, which contain .glb
, .house
, .navmesh
and .ply
files, in data/scene_datasets/mp3d
.
For further info about the structuring of the associated datasets, refer to audio_separation/config/default.py
and the task configs.
CUDA_VISIBLE_DEVICES=0 python3 main.py --exp-config audio_separation/config/pretrain_passive.yaml --model-dir runs/passive_pretrain --run-type train NUM_PROCESSES 1
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -u -m torch.distributed.launch --use_env --nproc_per_node 8 main.py --exp-config audio_separation/config/train/nearTarget.yaml --model-dir runs/active_train/near_target --run-type train NUM_PROCESSES 14
First, link checkpoints using scripts/search_for_checkpoint_thru_validation/link_ckpts_for_val.ipynb
to search for best checkpoint on the basis of validation.
Then, run
CUDA_VISIBLE_DEVICES=0 python3 main.py --exp-config audio_separation/config/val/nearTarget.yaml --model-dir runs_val/active/near_target --run-type eval NUM_PROCESSES 1
Search for best checkpoint using scripts/search_for_checkpoint_thru_validation/find_bestCkpt_lowestValSTFTLoss.ipynb
.
For unheard sounds, use config/val/nearTarget_unheard.yaml
and use the corresponding validation directory.
First, copy the best checkpoint from the search to runs_test/active/nearTarget
.
Then, run
CUDA_VISIBLE_DEVICES=0 python3 main.py --exp-config audio_separation/config/test/nearTarget.yaml --model-dir runs_test/active/near_target --run-type eval NUM_PROCESSES 1
Compute test metric (STFT l2 loss or SI-SDR) values using scripts/separated_audio_quality/compute_separation_qualtiy.ipynb
.
For unheard sounds, use config/test/nearTarget_unheard.yaml
, and use the corresponding test directory.
Download model checkpoints from this link.
@inproceedings{majumder2022active,
title={Active audio-visual separation of dynamic sound sources},
author={Majumder, Sagnik and Grauman, Kristen},
booktitle={Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXIX},
pages={551--569},
year={2022},
organization={Springer}
}
This project is released under the MIT license, as found in the LICENSE file.