We provide a beginner recipe to demonstrate how to train a cutting edge TTA model. Specifically, it is designed as a latent diffusion model like AudioLDM, Make-an-Audio, and AUDIT.
Until now, Amphion has supported a latent diffusion based text-to-audio model:
Similar to AUDIT, we implement it in two-stage training:
- Training the VAE which is called
AutoencoderKL
in Amphion. - Training the conditional latent diffusion model which is called
AudioLDM
in Amphion.