Code is available now under the folder video-physics-sound-diffusion!
- meta data
- rgb frames
- audio
- pre-trained residual prediction model
- extracted sound physics and residual parameters
- extracted visual features
- extracted physics latents
- pre-trained diffusion model
For questions or help, please open an issue or contact suk4 at uw dot edu
- python=3.9
- pytorch=1.12.0
- torchaudio=0.12.0
- librosa=0.9.2
- auraloss=0.2.2
- einops=0.4.1
- einops-exts=0.0.3
- numpy=1.22.3
- opencv-python=4.6.0.66
- Download the Greatest Hits dataset videos and metadata (txt files).
- Use video_to_frames.py to extract rgb frames from video and save as 224x224 images. Processed rgb frames are available in Google Drive (zip file name: rgb).
- Use video_to_wav.py to extract impact sound segments from videos. Extracted audio files are available in Google Drive (zip file name: audio_data).
- Use extract_physics_params.py to extract physics parameters from audio and save freq, power, decay rate, ground truth audio, and reconstructed audio as pickle file.
- Use train_test_split_process_video_data.py to segment the video frames and save train/test meta files. Processed meta files are in Google Drive (segmented_video_data_split).
- Note: we mainly use the subset annotated with hits action and static reaction. This ends up at ten representative classes of materials (glass, wood, ceramic, metal, cloth, plastic, drywall, carpet, paper, and rock) in a total of 10k impact sounds. You could also try using all sounds available in the dataset. While the annotations are noisy, we find that using the physics + residual combination can still reconstruct the audio reasonably.
- Check the sound_residual.yaml and change the data root or other settings if needed.
- Under the
video-physics-sound-diffusion
directory, runCUDA_VISIBLE_DEVICES=0 python tools/sound_residual_train.py --configs configs/sound_residual.yaml
- Once done with training, change the
resume_path
insound_residual.yaml
to be your model path or use the pre-trained model here and you can runCUDA_VISIBLE_DEVICES=0 python tools/sound_residual_infer.py --cfg configs/sound_residual.yaml
to save both physics and predicted residual parameters as pickle file. - TODO: Add a jupyter notebook to demonstrate how to reconstruct the sound.
- You must obtain the audio physics and residual parameters before training the diffusion model.
- We use the visual features extracted from pre-trained resnet 50 + TSM classifier. We provide two types of features: 1) features before the classifier layer are available in here and the simple lower dimension logits here.
- Check the great_hits_spec_diff.yaml and change the data root or other settings if needed.
- Under the
video-physics-sound-diffusion
directory, runCUDA_VISIBLE_DEVICES=0 python tools/train.py --cfg configs/great_hits_spec_diff.yaml
- Step 0: change the
resume_path
ingreat_hits_spec_diff.yaml
to be your model path. - Step 1: Under the
video-physics-sound-diffusion
directory, runCUDA_VISIBLE_DEVICES=0 python tools/extract_latents.py --cfg configs/great_hits_spec_diff.yaml
to extract physics latents and save as pickle files. - Step 2: Under the
video-physics-sound-diffusion
directory, runCUDA_VISIBLE_DEVICES=0 python tools/query_latents.py --cfg configs/great_hits_spec_diff.yaml
that will use test visual feature to query closest physics latent in training set. - Step 3: Run
CUDA_VISIBLE_DEVICES=0 python tools/generate_samples.py --configs confings/great_hits_spec_diff.yaml
to generate wave file. - Using Pre-trained Model: Please first download the processed data, then place them under the data_root you use in the config file. Also, download the model weights and place it under the logs folder. Then, run Step 3 to generate samples.
- TODO: Add a jupyter notebook for an easier demo.
If you find this repo useful for your research, please consider citing the paper
@inproceedings{su2023physics, title={Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos}, author={Su, Kun and Qian, Kaizhi and Shlizerman, Eli and Torralba, Antonio and Gan, Chuang}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={9749--9759}, year={2023} }
Part of the code is borrowed from the following repo and we would like to thank the authors for their contribution.
We would like to thank the authors of the Greatest Hits dataset for making this dataset possible. We would like to thank Vinayak Agarwal for his suggestions on physics mode parameters estimation from raw audio. We would like to thank the authors of DiffImpact for inspiring us to use the physics-based sound synthesis method to design physics priors as a conditional signal to guide the deep generative model synthesizes impact sounds from videos.