Text-to-Audio with Latent Diffusion Model

This is the quicktour for training a text-to-audio model with the popular and powerful generative model: Latent Diffusion Model. Specially, this recipe is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper "AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models". You can check the last part of AUDIT demos to see same text-to-audio examples.

We train this latent diffusion model in two stages:

In the first stage, we aims to obtain a high-quality VAE (called AutoencoderKL in Amphion), in order that we can project the input mel-spectrograms to an efficient, low-dimensional latent space. Specially, we train the VAE with GAN loss to improve the reconstruction quality.
In the second stage, we aims to obtain a text-controllable diffusion model (called AudioLDM in Amphion). We use U-Net architecture diffusion model, and use T5 encoder as text encoder.

There are four stages in total for training the text-to-audio model:

Data preparation and processing
Train the VAE model
Train the latent diffusion model
Inference

NOTE: You need to run every command of this recipe in the Amphion root path:
cd Amphion

Overview

# Train the VAE model
sh egs/tta/autoencoderkl/run_train.sh

# Train the latent diffusion model
sh egs/tta/audioldm/run_train.sh

# Inference
sh egs/tta/audioldm/run_inference.sh

1. Data preparation and processing

Dataset Download

We take AudioCaps as an example, AudioCaps is a dataset of around 44K audio-caption pairs, where each audio clip corresponds to a caption with rich semantic information. You can download the dataset here.

Data Processing

Download AudioCaps dataset to [Your path to save tta dataset] and modify preprocess.processed_dir in egs/tta/.../exp_config.json.

{
  "dataset": [
    "AudioCaps"
  ],
  "preprocess": {
    // Specify the output root path to save the processed data 
    "processed_dir": "[Your path to save tta dataset]",
    ...
  }
}

The folder structure of your downloaded data should be similar to:

.../[Your path to save tta dataset]
┣ AudioCpas
┃   ┣ wav
┃   ┃   ┣ ---1_cCGK4M_0_10000.wav
┃   ┃   ┣ ---lTs1dxhU_30000_40000.wav
┃   ┃   ┣ ...

Then you may process the data to mel-specgram and save it as .npy format. If you use the data we provide, we have processed all the wav data.
Generate a json file to save the metadata, the json file is like:

[
    {
        "Dataset": "AudioCaps",
        "Uid": "---1_cCGK4M_0_10000",
        "Caption": "Idling car, train blows horn and passes"
    },
    {
        "Dataset": "AudioCaps",
        "Uid": "---lTs1dxhU_30000_40000",
        "Caption": "A racing vehicle engine is heard passing by"
    },
    ...
]

Finally, the folder structure is like:

.../[Your path to save tta dataset]
┣ AudioCpas
┃   ┣ wav
┃   ┃   ┣ ---1_cCGK4M_0_10000.wav
┃   ┃   ┣ ---lTs1dxhU_30000_40000.wav
┃   ┃   ┣ ...
┃   ┣ mel
┃   ┃   ┣ ---1_cCGK4M_0_10000.npy
┃   ┃   ┣ ---lTs1dxhU_30000_40000.npy
┃   ┃   ┣ ...
┃   ┣ train.json
┃   ┣ valid.json
┃   ┣ ...

2. Training the VAE Model

The first stage model is a VAE model trained with GAN loss (called AutoencoderKL in Amphion), run the follow commands:

sh egs/tta/autoencoderkl/run_train.sh

3. Training the Latent Diffusion Model

The second stage model is a condition diffusion model with a T5 text encoder (called AudioLDM in Amphion), run the following commands:

sh egs/tta/audioldm/run_train.sh

4. Inference

Now you can generate audio with your pre-trained latent diffusion model, run the following commands and modify the text argument.

sh egs/tta/audioldm/run_inference.sh \
--text "A man is whistling"

Citations

@article{wang2023audit,
  title={AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models},
  author={Wang, Yuancheng and Ju, Zeqian and Tan, Xu and He, Lei and Wu, Zhizheng and Bian, Jiang and Zhao, Sheng},
  journal={NeurIPS 2023},
  year={2023}
}

@article{liu2023audioldm,
  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
  journal={Proceedings of the International Conference on Machine Learning},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RECIPE.md

RECIPE.md

Text-to-Audio with Latent Diffusion Model

Overview

1. Data preparation and processing

Dataset Download

Data Processing

2. Training the VAE Model

3. Training the Latent Diffusion Model

4. Inference

Citations

Files

RECIPE.md

Latest commit

History

RECIPE.md

File metadata and controls

Text-to-Audio with Latent Diffusion Model

Overview

1. Data preparation and processing

Dataset Download

Data Processing

2. Training the VAE Model

3. Training the Latent Diffusion Model

4. Inference

Citations