Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation (ECCV 2020)
Hang Zhou*, Xudong Xu*, Dahua Lin, Xiaogang Wang, and Ziwei Liu.
We propose to integrate the task of stereophonic audio generation and audio source separation into a unified framework namely Sep-Stereo, which leverages vastly available mono audios to facilitate the training of stereophonic audio generation. Moreover, we design Associative Pyramid Network (APNet) which better associates the visual features and the audio features with a learned Associative-Conv operation, leading to performance improvement in both two tasks.
- Python 3.6 is used. Basic requirements are listed in the 'requirements.txt'
pip install -r requirements.txt
FAIR-Play can be accessed here. YT-Music can be accessed here.
MUSIC21 can be accessed here. As illustrated in our supplementary material, you'd better choose those instrument categories presented in the stereo dataset, such as cello, trumpet, piano, etc.
All the training and testing bash scripts can be found in './scripts'. Before training, please prepare stereo data as the instructions in FAIR-Play. For MUSIC21 dataset, please the videos into 10s clips and formulate the data split as './data/dummy_MUSIC_split'.
The usage of this software is under CC-BY-4.0.
@inproceedings{zhou2020sep,
title={Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation},
author={Zhou, Hang and Xu, Xudong and Lin, Dahua and Wang, Xiaogang and Liu, Ziwei},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2020}
}
The structure of this codebase is borrowed from 2.5D Visual Sound.