Unconditional model generates okay quality of fake human voice but failed on music. #80
Description
Hi, I've been playing with this diffusion model library for a few days, it is great to have such library that allows common users to train audio data with limited resources.
I have a problem with regard to the training data and the output. I fed the unconditional model with mozilla's common voice dataset. I used only one language and the size is about 15k. I resampled them to 44.1k and padded them to 2^18 samples per file if shorter. And the unconditional results were okay, at least I could tell it's human speaking although never actually audible.
But when I replace the training data with music (mostly pure pianos, same sample rate but 2^17 samples per input tensor), the model is not generating outputs that sounds like piano, in fact they are mostly noise.
I used the same configurations for each layers for both models, tried lowering the downsampling factors or increase attentions heads, but no significant difference. Any tips on why my problem happens?