-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/speecht5 bug #28481
Fix/speecht5 bug #28481
Conversation
- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues. - Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact. - This change resolves a critical bug affecting the model's performance in handling speaker embeddings.
- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite. - Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations. - Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing. - Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.
- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues. - Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact. - This change resolves a critical bug affecting the model's performance in handling speaker embeddings.
- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite. - Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations. - Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing. - Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.
…s into fix/speecht5-bug
To observe the issue addressed by this PR, visit the Hugging Face SpeechT5 Fine-Tuning Tutorial. During training, you encounter a concatenation error due to incorrect dimensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @NimaYaqmuri, thanks for raising the issue and taking care of the fix !
The proposed solution is great, but overlook one configuration! I left commentaries to address this!
To pass every checks, you should also make sure that the code is conformed to transformers's style (make style
)
Again, many thanks for this PR!
Also cc @sanchit-gandhi, I believe it will fix this forum discussion
speaker_embeddings = speaker_embeddings.unsqueeze(1) | ||
speaker_embeddings = speaker_embeddings.expand(-1, inputs_embeds.size(1), -1) | ||
speaker_embeddings = speaker_embeddings.repeat(inputs_embeds.size(0), 1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it, there are two possible situations:
- As many speaker embeddings as sample (i.e
batch size = number of speaker embeddings
) - One-to-many speaker embeddings (i.e one speaker embeddings for every sample of the batch)
Your proposed solution addresses situation 1., and the previous solution addressed situation 2.
To be complete, we should have a code that addresses both situations ! (and that throws an Error
in other cases).
inputs = processor( | ||
text=input_text, padding="max_length", max_length=128, return_tensors="pt" | ||
).to(torch_device) | ||
speaker_embeddings = torch.zeros((len(input_text), 512), device=torch_device) | ||
|
||
speaker_embeddings = torch.zeros((1, 512), device=torch_device) | ||
# Generate spectrograms | ||
set_seed(555) # Ensure deterministic behavior | ||
spectrograms, spectrogram_lengths = model.generate_speech( | ||
input_ids=inputs["input_ids"], | ||
speaker_embeddings=speaker_embeddings, | ||
attention_mask=inputs["attention_mask"], | ||
return_output_lengths=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following my previous commentary, we should also test the one-to-many situation !
Thanks for your advice! I'm going to tweak the |
- Refined the generate and generate_speech functions in the SpeechT5 class to robustly handle two scenarios for speaker embeddings: matching the batch size (one embedding per sample) and one-to-many (a single embedding for all samples in the batch). - The update includes logic to repeat the speaker embedding when a single embedding is provided for multiple samples, and a ValueError is raised for any mismatched dimensions. - Also added corresponding test cases to validate both scenarios, ensuring complete coverage and functionality for diverse speaker embedding situations.
Hello @ylacombe , @sanchit-gandhi and @ArthurZucker I've implemented your suggestions and made updates to the SpeechT5 Text to Speech class. My focus was on improving how we handle speaker embeddings in two key scenarios:
Key updates include:
Additionally, I've adapted the handling of speaker embedding dimensions outside the main model classes, aligning with the approach used in the original SpeechT5 implementation by Microsoft. This decision avoids altering the SpeechT5 speech decoder's pre-net forward method, maintaining consistency with the existing model structure. Please let me know if there are other scenarios or considerations I should account for. Your feedback is greatly appreciated. Thank you for your guidance, Nima Yaqmuri |
It seems there exists a related issue, Issue #28189, which my PR's speaker embedding updates could potentially address. |
Alongside the issue highlighted by @ylacombe in the Hugging Face thread "Audio Course Unit 6: Unable to Train SpeechT5", my PR also aims to resolve a similar problem outlined in "SpeechT5 Text-to-Speech Fine-Tuning Runtime Error", which seems to be related to the solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NimaYaqmuri for iterating on this ! This looks great to me!
The next step is to have a core maintainer approval, let's ask @amyeroberts for a review!
"he tells us that at this festive season of the year with christmas and rosebeaf looming before us", | ||
] | ||
inputs = processor(text=input_text, padding="max_length", max_length=128, return_tensors="pt").to(torch_device) | ||
speaker_embeddings = torch.zeros((len(input_text), 512), device=torch_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be nice to test it with different speaker embeddings (e.g random + seed) !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful - thanks for working on this!
Happy to merge once all speecht5 slow test have been run and confirmed to pass
"he tells us that at this festive season of the year with christmas and rosebeaf looming before us", | ||
] | ||
inputs = processor(text=input_text, padding="max_length", max_length=128, return_tensors="pt").to(torch_device) | ||
speaker_embeddings = torch.zeros((len(input_text), 512), device=torch_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Hi @ylacombe and @amyeroberts, Thanks a lot for your valuable feedback and approvals! I've implemented the suggested changes, introducing randomized speaker embeddings with a fixed seed in the Additionally, I've run all the relevant slow tests for I appreciate your support and guidance throughout this process! Best. |
Thanks again @NimaYaqmuri for this great contribution! |
* Fix bug in SpeechT5 speech decoder prenet's forward method - Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues. - Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact. - This change resolves a critical bug affecting the model's performance in handling speaker embeddings. * Refactor SpeechT5 text to speech integration tests - Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite. - Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations. - Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing. - Fixed existing test cases where incorrect assumptions about output shapes led to potential errors. * Fix bug in SpeechT5 speech decoder prenet's forward method - Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues. - Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact. - This change resolves a critical bug affecting the model's performance in handling speaker embeddings. * Refactor SpeechT5 text to speech integration tests - Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite. - Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations. - Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing. - Fixed existing test cases where incorrect assumptions about output shapes led to potential errors. * Enhance handling of speaker embeddings in SpeechT5 - Refined the generate and generate_speech functions in the SpeechT5 class to robustly handle two scenarios for speaker embeddings: matching the batch size (one embedding per sample) and one-to-many (a single embedding for all samples in the batch). - The update includes logic to repeat the speaker embedding when a single embedding is provided for multiple samples, and a ValueError is raised for any mismatched dimensions. - Also added corresponding test cases to validate both scenarios, ensuring complete coverage and functionality for diverse speaker embedding situations. * Improve Test Robustness with Randomized Speaker Embeddings
* Fix bug in SpeechT5 speech decoder prenet's forward method - Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues. - Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact. - This change resolves a critical bug affecting the model's performance in handling speaker embeddings. * Refactor SpeechT5 text to speech integration tests - Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite. - Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations. - Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing. - Fixed existing test cases where incorrect assumptions about output shapes led to potential errors. * Fix bug in SpeechT5 speech decoder prenet's forward method - Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues. - Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact. - This change resolves a critical bug affecting the model's performance in handling speaker embeddings. * Refactor SpeechT5 text to speech integration tests - Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite. - Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations. - Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing. - Fixed existing test cases where incorrect assumptions about output shapes led to potential errors. * Enhance handling of speaker embeddings in SpeechT5 - Refined the generate and generate_speech functions in the SpeechT5 class to robustly handle two scenarios for speaker embeddings: matching the batch size (one embedding per sample) and one-to-many (a single embedding for all samples in the batch). - The update includes logic to repeat the speaker embedding when a single embedding is provided for multiple samples, and a ValueError is raised for any mismatched dimensions. - Also added corresponding test cases to validate both scenarios, ensuring complete coverage and functionality for diverse speaker embedding situations. * Improve Test Robustness with Randomized Speaker Embeddings
What does this PR do?
Fixes a Critical Issue in SpeechT5 Speech Decoder Prenet and Enhances Test Suite
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ylacombe
@sanchit-gandhi
@Spycsh
Key Changes:
Critical Bug Fix in Speech Decoder Prenet: I discovered that the
repeat
operation in the speech decoder prenet's forward method was mistakenly duplicating thespeaker_embeddings
tensor. This erroneous behavior, likely an oversight in previous contributions, resulted in incorrect tensor dimensions for concatenation, leading to raised errors and halting the training processRefined Testing Approach: Alongside this fix, I have updated the SpeechT5ForTextToSpeechIntegrationTests. These updates include: