Replies: 2 comments 1 reply
-
What do you mean with "incorporate stable diffusion"? Stable diffusion is an architecture to generate images. If you mean text conditioning, that is already possible by providing text embeddings during training and sampling. Whisper is for speech recognition and this repo is for generation, those are not directly compatible as long as you don't have something more specific in mind. |
Beta Was this translation helpful? Give feedback.
-
Thanks for quick response Flavio. Just thinking out loud - but consider that the whisper framework (speech to text) I found this - which is using diffusion it seems like a big part of why these models suck so bad for tts is they don't have so much training data. |
Beta Was this translation helpful? Give feedback.
-
At first glance
https://github.com/openai/whisper
it seems unrelated.
I’m interested in how stable diffusion and audio could help make timbre transfer easier.
Seems like breaking audio into words and aligning this could be part of the puzzle to effectively do timbre transfer.
Does this project intend to incorporate stable diffusion down the track? Or is that not relevant/ applicable?
Beta Was this translation helpful? Give feedback.
All reactions