AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework

Diffusion-based text-to-audio (TTA) generation has made substantial progress, leveraging Latent Diffusion Model (LDM) to produce high-quality, diverse and instruction-relevant audios. However, beyond generation, the task of audio editing remains equally important but has received comparatively little attention. Audio editing tasks face two primary challenges: executing precise edits and preserving the unedited sections. While workflows based on LDMs have effectively addressed these challenges in the field of image processing, similar approaches have been scarcely applied to audio editing. In this paper, we introduce AudioEditor, a training-free audio editing framework built on the pretrained diffusion-based TTA model. AudioEditor incorporates Null-text Inversion and EOT-Suppression methods, enabling the model to preserve original audio features while executing accurate edits. Comprehensive objective and subjective experiments validate the effectiveness of AudioEditor in delivering high-quality audio edits.

🚀 Features

Pre-trained Model: Auffusion
We use the pre-trained model Auffusion for audio editing tasks.
Auffusion Repository | Model Download Link
Null-text Inversion: Ensures preservation of unedited audio portions during the editing process.
EOT-suppression: Enhance the model's ability to preserve original audio features and improve editing capabilities.
Support multiple audio editing operations: Add, Delete and Replace.
Easy integration with other TTA models: Plug and play with existing TTA diffuser-based models.

📀 Installation

Clone the repository:

git clone https://github.com/YuuhangJia/AudioEditor.git
cd AudioEditor

Install the required dependencies:
```
pip install -r requirements.txt
```
Set up additional environment variables or configurations (if any):
```
export YOUR_ENV_VAR=your_value
```

⭐ Usage

We will release the code for the main methods proposed in our paper after its acceptance immediately!

1️⃣ Delete

To run the deletion on an audio with a simple example, use the following command:

python main.py  --prompt "After a gunshot, there was a burst of dog barking" \
                --audio_path "audio_examples/input_audios/After a gunshot, there was a burst of dog barking.wav" \
                --token_indices "[[10,11]]" \
                --alpha "[1.,]" --cross_retain_steps "[.2,]"

2️⃣ Replace

To run the Replacement on an audio with a simple example, use the following command:

python main.py  --prompt "After a thunder, there was a burst of dog barking" \
                --audio_path "audio_examples/input_audios/After a gunshot, there was a burst of dog barking.wav" \
                --token_indices "[[3]]" \
                --alpha "[-0.001,]" --cross_retain_steps "[.2,]"

3️⃣ Add

python main.py  --prompt "A woman is giving a speech amid applause" \
                --audio_path "audio_examples/input_audios/A woman is giving a speech.wav" \
                --token_indices "[[7,8]]" \
                --alpha "[-0.001,]" --cross_retain_steps "[.2,]"

📐 Quantitative comparison

Objective Evaluation Results

Edit_Models	Edit_Type	Overall Quality		Similarity with (Regenerated_wavs)			Similarity with (Original_wavs)
Edit_Models	Edit_Type	Clap↑	IS ↑	FD ↓	FAD ↓	KL ↓	FD ↓	FAD ↓	KL ↓
Original_wavs	add	51.4%	5.64	44.71	5.28	1.78	-	-	-
	delete	51.5%	4.26	51.82	6.16	1.85	-	-	-
	replace	41.6%	4.41	69.92	7.88	4.56	-	-	-
	Average	48.2%	4.77	55.48	6.45	2.73	-	-	-
Regenerated_wavs	add	59.7%	5.96	-	-	-	44.71	5.28	1.36
	delete	59.1%	4.47	-	-	-	51.82	6.16	2.39
	replace	58.9%	5.13	-	-	-	69.92	7.88	4.09
	Average	59.2%	5.19	-	-	-	55.48	6.45	2.61
SDEdit(baseline)	add	58.4%	6.36	27.89	2.74	0.79	36.74	3.08	1.08
	delete	53.3%	5.31	55.12	6.65	1.78	40.43	6.95	0.88
	replace	58.6%	4.99	29.76	3.24	0.80	55.21	7.00	3.40
	Average	56.8%	5.55	37.59	4.21*	1.12*	44.13*	5.68*	1.79
AudioEditor(ours)	add	59.4%	6.16	27.83	2.41	0.85	40.00	3.52	1.27
	delete	54.1%	4.75	52.56	5.02	1.54	37.16	4.91	1.05
	replace	58.1%	5.14	28.80	3.34	0.79	59.46	7.52	3.73
	Average	57.6%*	5.19*	37.63*	3.27	1.07	43.48	4.95	1.93*

* indicates a suboptimal value, which may represent more desirable than optimal one in certain metrics.

🤝🏻 Contact

Should you have any questions, please contact 2120240729@mail.nankai.edu.cn

📚 Citation

Coming soon.

🐍 License

The code in this repository is licensed under the MIT License for academic and other non-commercial uses.

🙏 Acknowledgment:

This code is based on the P2P, Null-text , SuppressEOT and Auffusion repositories.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
audio_examples/input_audios		audio_examples/input_audios
auffusion		auffusion
docs		docs
eval		eval
null_text_inversion		null_text_inversion
prompt2prompt		prompt2prompt
suppresseot		suppresseot
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
main.py		main.py
main.sh		main.sh
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework

🚀 Features

📀 Installation

⭐ Usage

1️⃣ Delete

2️⃣ Replace

3️⃣ Add

📐 Quantitative comparison

🤝🏻 Contact

📚 Citation

🐍 License

🙏 Acknowledgment:

About

Releases

Packages

Languages

License

NKU-HLT/AudioEditor

Folders and files

Latest commit

History

Repository files navigation

AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework

🚀 Features

📀 Installation

⭐ Usage

1️⃣ Delete

2️⃣ Replace

3️⃣ Add

📐 Quantitative comparison

🤝🏻 Contact

📚 Citation

🐍 License

🙏 Acknowledgment:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages