Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/speecht5 bug #28481

Merged
merged 9 commits into from
Jan 16, 2024
Merged

Fix/speecht5 bug #28481

merged 9 commits into from
Jan 16, 2024

Conversation

NimaYaqmuri
Copy link
Contributor

What does this PR do?

Fixes a Critical Issue in SpeechT5 Speech Decoder Prenet and Enhances Test Suite

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ylacombe
@sanchit-gandhi
@Spycsh

Key Changes:

  • Critical Bug Fix in Speech Decoder Prenet: I discovered that the repeat operation in the speech decoder prenet's forward method was mistakenly duplicating the speaker_embeddings tensor. This erroneous behavior, likely an oversight in previous contributions, resulted in incorrect tensor dimensions for concatenation, leading to raised errors and halting the training process

  • Refined Testing Approach: Alongside this fix, I have updated the SpeechT5ForTextToSpeechIntegrationTests. These updates include:

    • Adaptability to Variability in Sequence Lengths: Modifications to handle variability due to dropout in the speech decoder pre-net, ensuring test reliability against random variations.
    • Dynamic Dimension Checks: Replacement of hardcoded dimensions with dynamic checks based on the model's configuration and seed settings, ensuring test validity across various scenarios.
    • New and Improved Test Cases: Introduction of new test cases for validation of spectrogram and waveform shapes, addressing potential issues in speech generation and vocoder processing.
    • Correction of Misassumptions in Tests: Adjustment of existing test cases where previous assumptions about output shapes led to inaccuracies. This includes considering varying batch sizes in tests, which were not adequately addressed before, possibly due to an oversight in considering the speaker embeddings' shape (initially 1x512) in batch scenarios.

- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues.
- Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact.
- This change resolves a critical bug affecting the model's performance in handling speaker embeddings.
- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite.
- Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations.
- Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing.
- Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.
- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues.
- Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact.
- This change resolves a critical bug affecting the model's performance in handling speaker embeddings.
- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite.
- Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations.
- Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing.
- Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.
@NimaYaqmuri
Copy link
Contributor Author

To observe the issue addressed by this PR, visit the Hugging Face SpeechT5 Fine-Tuning Tutorial. During training, you encounter a concatenation error due to incorrect dimensions.

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @NimaYaqmuri, thanks for raising the issue and taking care of the fix !

The proposed solution is great, but overlook one configuration! I left commentaries to address this!

To pass every checks, you should also make sure that the code is conformed to transformers's style (make style)

Again, many thanks for this PR!

Also cc @sanchit-gandhi, I believe it will fix this forum discussion

Comment on lines -698 to -700
speaker_embeddings = speaker_embeddings.unsqueeze(1)
speaker_embeddings = speaker_embeddings.expand(-1, inputs_embeds.size(1), -1)
speaker_embeddings = speaker_embeddings.repeat(inputs_embeds.size(0), 1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I see it, there are two possible situations:

  1. As many speaker embeddings as sample (i.e batch size = number of speaker embeddings)
  2. One-to-many speaker embeddings (i.e one speaker embeddings for every sample of the batch)

Your proposed solution addresses situation 1., and the previous solution addressed situation 2.

To be complete, we should have a code that addresses both situations ! (and that throws an Error in other cases).

Comment on lines 1082 to 1094
inputs = processor(
text=input_text, padding="max_length", max_length=128, return_tensors="pt"
).to(torch_device)
speaker_embeddings = torch.zeros((len(input_text), 512), device=torch_device)

speaker_embeddings = torch.zeros((1, 512), device=torch_device)
# Generate spectrograms
set_seed(555) # Ensure deterministic behavior
spectrograms, spectrogram_lengths = model.generate_speech(
input_ids=inputs["input_ids"],
speaker_embeddings=speaker_embeddings,
attention_mask=inputs["attention_mask"],
return_output_lengths=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my previous commentary, we should also test the one-to-many situation !

@NimaYaqmuri
Copy link
Contributor Author

Thanks for your advice! I'm going to tweak the generate_speech for these two cases.

NimaYaqmuri and others added 3 commits January 12, 2024 20:33
- Refined the generate and generate_speech functions in the SpeechT5 class to robustly handle two scenarios for speaker embeddings: matching the batch size (one embedding per sample) and one-to-many (a single embedding for all samples in the batch).
- The update includes logic to repeat the speaker embedding when a single embedding is provided for multiple samples, and a ValueError is raised for any mismatched dimensions.
- Also added corresponding test cases to validate both scenarios, ensuring complete coverage and functionality for diverse speaker embedding situations.
@NimaYaqmuri
Copy link
Contributor Author

NimaYaqmuri commented Jan 12, 2024

Hello @ylacombe , @sanchit-gandhi and @ArthurZucker

I've implemented your suggestions and made updates to the SpeechT5 Text to Speech class. My focus was on improving how we handle speaker embeddings in two key scenarios:

  1. Matching Batch Size: When there's one embedding per sample.
  2. One-to-Many: When a single embedding is used for all samples in the batch.

Key updates include:

  • Embedding Replication: Introduced logic to replicate a single speaker embedding across multiple samples when necessary.
  • Error Handling: Implemented a ValueError to alert for dimension mismatches in speaker embeddings.
  • Testing: Added comprehensive test cases to ensure robust functionality across both scenarios.

Additionally, I've adapted the handling of speaker embedding dimensions outside the main model classes, aligning with the approach used in the original SpeechT5 implementation by Microsoft. This decision avoids altering the SpeechT5 speech decoder's pre-net forward method, maintaining consistency with the existing model structure.

Please let me know if there are other scenarios or considerations I should account for. Your feedback is greatly appreciated.

Thank you for your guidance,

Nima Yaqmuri

@NimaYaqmuri
Copy link
Contributor Author

It seems there exists a related issue, Issue #28189, which my PR's speaker embedding updates could potentially address.

@NimaYaqmuri
Copy link
Contributor Author

Alongside the issue highlighted by @ylacombe in the Hugging Face thread "Audio Course Unit 6: Unable to Train SpeechT5", my PR also aims to resolve a similar problem outlined in "SpeechT5 Text-to-Speech Fine-Tuning Runtime Error", which seems to be related to the solution.

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NimaYaqmuri for iterating on this ! This looks great to me!

The next step is to have a core maintainer approval, let's ask @amyeroberts for a review!

"he tells us that at this festive season of the year with christmas and rosebeaf looming before us",
]
inputs = processor(text=input_text, padding="max_length", max_length=128, return_tensors="pt").to(torch_device)
speaker_embeddings = torch.zeros((len(input_text), 512), device=torch_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be nice to test it with different speaker embeddings (e.g random + seed) !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful - thanks for working on this!

Happy to merge once all speecht5 slow test have been run and confirmed to pass

"he tells us that at this festive season of the year with christmas and rosebeaf looming before us",
]
inputs = processor(text=input_text, padding="max_length", max_length=128, return_tensors="pt").to(torch_device)
speaker_embeddings = torch.zeros((len(input_text), 512), device=torch_device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@NimaYaqmuri
Copy link
Contributor Author

Hi @ylacombe and @amyeroberts,

Thanks a lot for your valuable feedback and approvals!

I've implemented the suggested changes, introducing randomized speaker embeddings with a fixed seed in the speecht5 tests.

Additionally, I've run all the relevant slow tests for speecht5 on my end, and everything works as expected. It seems we're ready for the merge.

I appreciate your support and guidance throughout this process!

Best.

@amyeroberts amyeroberts merged commit 07ae53e into huggingface:main Jan 16, 2024
18 checks passed
@amyeroberts
Copy link
Collaborator

Thanks again @NimaYaqmuri for this great contribution!

@NimaYaqmuri NimaYaqmuri deleted the fix/speecht5-bug branch January 16, 2024 15:26
wgifford pushed a commit to wgifford/transformers that referenced this pull request Jan 21, 2024
* Fix bug in SpeechT5 speech decoder prenet's forward method

- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues.
- Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact.
- This change resolves a critical bug affecting the model's performance in handling speaker embeddings.

* Refactor SpeechT5 text to speech integration tests

- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite.
- Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations.
- Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing.
- Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.

* Fix bug in SpeechT5 speech decoder prenet's forward method

- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues.
- Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact.
- This change resolves a critical bug affecting the model's performance in handling speaker embeddings.

* Refactor SpeechT5 text to speech integration tests

- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite.
- Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations.
- Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing.
- Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.

* Enhance handling of speaker embeddings in SpeechT5

- Refined the generate and generate_speech functions in the SpeechT5 class to robustly handle two scenarios for speaker embeddings: matching the batch size (one embedding per sample) and one-to-many (a single embedding for all samples in the batch).
- The update includes logic to repeat the speaker embedding when a single embedding is provided for multiple samples, and a ValueError is raised for any mismatched dimensions.
- Also added corresponding test cases to validate both scenarios, ensuring complete coverage and functionality for diverse speaker embedding situations.

* Improve Test Robustness with Randomized Speaker Embeddings
AjayP13 pushed a commit to AjayP13/transformers that referenced this pull request Jan 22, 2024
* Fix bug in SpeechT5 speech decoder prenet's forward method

- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues.
- Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact.
- This change resolves a critical bug affecting the model's performance in handling speaker embeddings.

* Refactor SpeechT5 text to speech integration tests

- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite.
- Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations.
- Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing.
- Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.

* Fix bug in SpeechT5 speech decoder prenet's forward method

- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues.
- Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact.
- This change resolves a critical bug affecting the model's performance in handling speaker embeddings.

* Refactor SpeechT5 text to speech integration tests

- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite.
- Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations.
- Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing.
- Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.

* Enhance handling of speaker embeddings in SpeechT5

- Refined the generate and generate_speech functions in the SpeechT5 class to robustly handle two scenarios for speaker embeddings: matching the batch size (one embedding per sample) and one-to-many (a single embedding for all samples in the batch).
- The update includes logic to repeat the speaker embedding when a single embedding is provided for multiple samples, and a ValueError is raised for any mismatched dimensions.
- Also added corresponding test cases to validate both scenarios, ensuring complete coverage and functionality for diverse speaker embedding situations.

* Improve Test Robustness with Randomized Speaker Embeddings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants