Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

Highlight

We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible.

Setup

The setup process is the same as in MMAudio.

Pretrained models:

The pretrained model is available at https://huggingface.co/masato-a-ishii/CoherentAVEdit/tree/main

Audio generation after video edits

In the following, we assume the pretrained model is stored in ./weights/

Before running the script, you need to first apply some video editing method to source videos and store them somewhere accessible.

For a single video

python edit_single_video.py --duration=<duration> --video=<path-to-video> --prompt="your prompt" --mask_level=<mask level> --output=<path-to-output-dir>

For multiple videos (evaluation with AvED-benchmark dataset)

First, you need to prepare for the following things:

CSV_FILE: a CSV file used in AvED-Bench. Please refer to https://github.com/GenjiB/AVED/tree/main
SOURCE_DIR: a directory containing the source videos of AvED-Bench.
EDITED_VIDEO_DIR: a directory containing the edited videos.

Then, run the following.

# Hyperparameters for editing
MASK_LEVEL=-1 # This means adaptive conditioning.
MAX_MASK_LEVEL=4 # l_max in the paper.

# Create a score file
python compute_IB_scores.py --csv-file $CSV_FILE --source-dir $SOURCE_DIR --target-dir $EDITED_VIDEO_DIR --output-csv <path-to-score-file>

# Audio editing
python edit_multiple_videos_for_benchmark.py --video_dir $EDITED_VIDEO_DIR --csv_file $CSV_FILE --score_file <path-to-score-file> --output <path-to-output-dir> --mask_level $MASK_LEVEL --min_mask_level 1 --max_mask_level $MAX_MASK_LEVEL

Citation

@article{ishii2025coherent,
  title={Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits},
  author={Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Mitsufuji, Yuki},
  journal={arXiv preprint arXiv:2512.07209},
  year={2025}
}

Acknowledgement

This repository is based on MMAudio.

Many thanks to:

Make-An-Audio 2 for the 16kHz BigVGAN pretrained model and the VAE architecture
BigVGAN
Synchformer
EDM2 for the magnitude-preserving VAE network architecture
ImageBind (by Meta / FAIR) for computing ImageBind scores
- We modified the following two things:
  - set num_crops in data.load_and_transform_video_data to 1.
  - add "reduce" argument to models.imagebind_model.forward so that we can choose whether the model returns a feature tensor reduced over temporal dimension or not.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
imagebind		imagebind
mmaudio		mmaudio
sets		sets
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batch_eval.py		batch_eval.py
compute_IB_scores.py		compute_IB_scores.py
edit_multiple_videos_for_benchmark.py		edit_multiple_videos_for_benchmark.py
edit_single_video.py		edit_single_video.py
pyproject.toml		pyproject.toml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

Highlight

Setup

Audio generation after video edits

For a single video

For multiple videos (evaluation with AvED-benchmark dataset)

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits

Highlight

Setup

Audio generation after video edits

For a single video

For multiple videos (evaluation with AvED-benchmark dataset)

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages