-
Notifications
You must be signed in to change notification settings - Fork 315
Description
Hello,
I have a question regarding the current pitch models, specifically the differences between Reflow and DDPM. With the latest update, it seems like Reflow has become the new default and recommended setting for training acoustic and variance models. While Reflow is very fast—faster than DDPM—it appears to be at the cost of quality.
I've conducted multiple experiments with my dataset of three speakers (a soprano, a mezzo-soprano, and a tenor), each with approximately three hours of Japanese singing data, using the multispeaker method. Unfortunately, the experiments using Reflow for the pitch models have been inconsistent in my experience. The speakers are all very expressive and stylized in their singing, which is rarely reflected in the results. I've tried different batch sizes, maximum steps, step sizes, and switched between L1 and L2 loss functions, but none of these adjustments have produced the desired results. Specifically, I find that Reflow does not accurately replicate the singers' styles. The resulting F0 is relatively flat, with little variation or randomness, and the singing style feels "safe" with minimal vibrato, even when the singer uses vibrato frequently.
On the other hand, experiments using DDPM have yielded much clearer and more accurate results, better replicating the singers' styles. It seems to me that DDPM trains more carefully compared to Reflow.
My question is: What could be the reason for this difference in results between these two diffusion types? Might DDPM be more suited for highly stylized and random singing, especially when using L2 loss for bigger outliers? Is Reflow more suited for singing that is less random?
Thank you in advance.