Fix/speecht5 bug #28481

NimaYaqmuri · 2024-01-12T14:33:19Z

What does this PR do?

Fixes a Critical Issue in SpeechT5 Speech Decoder Prenet and Enhances Test Suite

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ylacombe
@sanchit-gandhi
@Spycsh

Key Changes:

Critical Bug Fix in Speech Decoder Prenet: I discovered that the repeat operation in the speech decoder prenet's forward method was mistakenly duplicating the speaker_embeddings tensor. This erroneous behavior, likely an oversight in previous contributions, resulted in incorrect tensor dimensions for concatenation, leading to raised errors and halting the training process
Refined Testing Approach: Alongside this fix, I have updated the SpeechT5ForTextToSpeechIntegrationTests. These updates include:
- Adaptability to Variability in Sequence Lengths: Modifications to handle variability due to dropout in the speech decoder pre-net, ensuring test reliability against random variations.
- Dynamic Dimension Checks: Replacement of hardcoded dimensions with dynamic checks based on the model's configuration and seed settings, ensuring test validity across various scenarios.
- New and Improved Test Cases: Introduction of new test cases for validation of spectrogram and waveform shapes, addressing potential issues in speech generation and vocoder processing.
- Correction of Misassumptions in Tests: Adjustment of existing test cases where previous assumptions about output shapes led to inaccuracies. This includes considering varying batch sizes in tests, which were not adequately addressed before, possibly due to an oversight in considering the speaker embeddings' shape (initially 1x512) in batch scenarios.

- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues. - Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact. - This change resolves a critical bug affecting the model's performance in handling speaker embeddings.

- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite. - Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations. - Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing. - Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.

- Removed redundant `repeat` operation on speaker_embeddings in the forward method. This line was erroneously duplicating the embeddings, leading to incorrect input size for concatenation and performance issues. - Maintained original functionality of the method, ensuring the integrity of the speech decoder prenet's forward pass remains intact. - This change resolves a critical bug affecting the model's performance in handling speaker embeddings.

- Updated SpeechT5ForTextToSpeechIntegrationTests to accommodate the variability in sequence lengths due to dropout in the speech decoder pre-net. This change ensures that our tests are robust against random variations in generated speech, enhancing the reliability of our test suite. - Removed hardcoded dimensions in test assertions. Replaced with dynamic checks based on model configuration and seed settings, ensuring tests remain valid across different runs and configurations. - Added new test cases to thoroughly validate the shapes of generated spectrograms and waveforms. These tests leverage seed settings to ensure consistent and predictable behavior in testing, addressing potential issues in speech generation and vocoder processing. - Fixed existing test cases where incorrect assumptions about output shapes led to potential errors.

…s into fix/speecht5-bug

NimaYaqmuri · 2024-01-12T14:44:25Z

To observe the issue addressed by this PR, visit the Hugging Face SpeechT5 Fine-Tuning Tutorial. During training, you encounter a concatenation error due to incorrect dimensions.

ylacombe

Hi @NimaYaqmuri, thanks for raising the issue and taking care of the fix !

The proposed solution is great, but overlook one configuration! I left commentaries to address this!

To pass every checks, you should also make sure that the code is conformed to transformers's style (make style)

Again, many thanks for this PR!

Also cc @sanchit-gandhi, I believe it will fix this forum discussion

ylacombe · 2024-01-12T14:49:28Z

src/transformers/models/speecht5/modeling_speecht5.py

-            speaker_embeddings = speaker_embeddings.unsqueeze(1)
-            speaker_embeddings = speaker_embeddings.expand(-1, inputs_embeds.size(1), -1)
-            speaker_embeddings = speaker_embeddings.repeat(inputs_embeds.size(0), 1, 1)


The way I see it, there are two possible situations:

As many speaker embeddings as sample (i.e batch size = number of speaker embeddings)

One-to-many speaker embeddings (i.e one speaker embeddings for every sample of the batch)

Your proposed solution addresses situation 1., and the previous solution addressed situation 2.

To be complete, we should have a code that addresses both situations ! (and that throws an Error in other cases).

ylacombe · 2024-01-12T14:50:42Z

tests/models/speecht5/test_modeling_speecht5.py

+        inputs = processor(
+            text=input_text, padding="max_length", max_length=128, return_tensors="pt"
+        ).to(torch_device)
+        speaker_embeddings = torch.zeros((len(input_text), 512), device=torch_device)

-        speaker_embeddings = torch.zeros((1, 512), device=torch_device)
+        # Generate spectrograms
+        set_seed(555) # Ensure deterministic behavior
        spectrograms, spectrogram_lengths = model.generate_speech(
            input_ids=inputs["input_ids"],
            speaker_embeddings=speaker_embeddings,
            attention_mask=inputs["attention_mask"],
            return_output_lengths=True,
        )


Following my previous commentary, we should also test the one-to-many situation !

NimaYaqmuri · 2024-01-12T15:39:26Z

Thanks for your advice! I'm going to tweak the generate_speech for these two cases.

- Refined the generate and generate_speech functions in the SpeechT5 class to robustly handle two scenarios for speaker embeddings: matching the batch size (one embedding per sample) and one-to-many (a single embedding for all samples in the batch). - The update includes logic to repeat the speaker embedding when a single embedding is provided for multiple samples, and a ValueError is raised for any mismatched dimensions. - Also added corresponding test cases to validate both scenarios, ensuring complete coverage and functionality for diverse speaker embedding situations.

NimaYaqmuri · 2024-01-12T20:12:15Z

Hello @ylacombe , @sanchit-gandhi and @ArthurZucker

I've implemented your suggestions and made updates to the SpeechT5 Text to Speech class. My focus was on improving how we handle speaker embeddings in two key scenarios:

Matching Batch Size: When there's one embedding per sample.
One-to-Many: When a single embedding is used for all samples in the batch.

Key updates include:

Embedding Replication: Introduced logic to replicate a single speaker embedding across multiple samples when necessary.
Error Handling: Implemented a ValueError to alert for dimension mismatches in speaker embeddings.
Testing: Added comprehensive test cases to ensure robust functionality across both scenarios.

Additionally, I've adapted the handling of speaker embedding dimensions outside the main model classes, aligning with the approach used in the original SpeechT5 implementation by Microsoft. This decision avoids altering the SpeechT5 speech decoder's pre-net forward method, maintaining consistency with the existing model structure.

Please let me know if there are other scenarios or considerations I should account for. Your feedback is greatly appreciated.

Thank you for your guidance,

Nima Yaqmuri

NimaYaqmuri · 2024-01-12T20:54:23Z

It seems there exists a related issue, Issue #28189, which my PR's speaker embedding updates could potentially address.

NimaYaqmuri · 2024-01-12T22:37:59Z

Alongside the issue highlighted by @ylacombe in the Hugging Face thread "Audio Course Unit 6: Unable to Train SpeechT5", my PR also aims to resolve a similar problem outlined in "SpeechT5 Text-to-Speech Fine-Tuning Runtime Error", which seems to be related to the solution.

ylacombe

Thanks @NimaYaqmuri for iterating on this ! This looks great to me!

The next step is to have a core maintainer approval, let's ask @amyeroberts for a review!

ylacombe · 2024-01-15T15:48:37Z

tests/models/speecht5/test_modeling_speecht5.py

+            "he tells us that at this festive season of the year with christmas and rosebeaf looming before us",
+        ]
+        inputs = processor(text=input_text, padding="max_length", max_length=128, return_tensors="pt").to(torch_device)
+        speaker_embeddings = torch.zeros((len(input_text), 512), device=torch_device)


Could be nice to test it with different speaker embeddings (e.g random + seed) !

amyeroberts

Beautiful - thanks for working on this!

Happy to merge once all speecht5 slow test have been run and confirmed to pass

amyeroberts · 2024-01-15T18:27:41Z

tests/models/speecht5/test_modeling_speecht5.py

+            "he tells us that at this festive season of the year with christmas and rosebeaf looming before us",
+        ]
+        inputs = processor(text=input_text, padding="max_length", max_length=128, return_tensors="pt").to(torch_device)
+        speaker_embeddings = torch.zeros((len(input_text), 512), device=torch_device)


NimaYaqmuri · 2024-01-16T10:17:26Z

Hi @ylacombe and @amyeroberts,

Thanks a lot for your valuable feedback and approvals!

I've implemented the suggested changes, introducing randomized speaker embeddings with a fixed seed in the speecht5 tests.

Additionally, I've run all the relevant slow tests for speecht5 on my end, and everything works as expected. It seems we're ready for the merge.

I appreciate your support and guidance throughout this process!

Best.

amyeroberts · 2024-01-16T14:14:49Z

Thanks again @NimaYaqmuri for this great contribution!

NimaYaqmuri added 5 commits January 12, 2024 16:30

Merge branch 'fix/speecht5-bug' of github.com:NimaYaqmuri/transformer…

aca772b

…s into fix/speecht5-bug

ylacombe reviewed Jan 12, 2024

View reviewed changes

NimaYaqmuri and others added 3 commits January 12, 2024 20:33

Merge branch 'huggingface:main' into fix/speecht5-bug

65121d5

Merge branch 'huggingface:main' into fix/speecht5-bug

b6e5a8c

NimaYaqmuri requested a review from ylacombe January 12, 2024 22:26

ylacombe approved these changes Jan 15, 2024

View reviewed changes

ylacombe requested a review from amyeroberts January 15, 2024 15:53

amyeroberts approved these changes Jan 15, 2024

View reviewed changes

Improve Test Robustness with Randomized Speaker Embeddings

8c813ff

amyeroberts merged commit 07ae53e into huggingface:main Jan 16, 2024

NimaYaqmuri deleted the fix/speecht5-bug branch January 16, 2024 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/speecht5 bug #28481

Fix/speecht5 bug #28481

NimaYaqmuri commented Jan 12, 2024

NimaYaqmuri commented Jan 12, 2024

ylacombe left a comment

ylacombe Jan 12, 2024

ylacombe Jan 12, 2024

NimaYaqmuri commented Jan 12, 2024

NimaYaqmuri commented Jan 12, 2024 •

edited

Loading

NimaYaqmuri commented Jan 12, 2024

NimaYaqmuri commented Jan 12, 2024

ylacombe left a comment

ylacombe Jan 15, 2024

amyeroberts Jan 15, 2024

amyeroberts left a comment

amyeroberts Jan 15, 2024

NimaYaqmuri commented Jan 16, 2024

amyeroberts commented Jan 16, 2024

Fix/speecht5 bug #28481

Fix/speecht5 bug #28481

Conversation

NimaYaqmuri commented Jan 12, 2024

What does this PR do?

Before submitting

Who can review?

Key Changes:

NimaYaqmuri commented Jan 12, 2024

ylacombe left a comment

Choose a reason for hiding this comment

ylacombe Jan 12, 2024

Choose a reason for hiding this comment

ylacombe Jan 12, 2024

Choose a reason for hiding this comment

NimaYaqmuri commented Jan 12, 2024

NimaYaqmuri commented Jan 12, 2024 • edited Loading

Key updates include:

NimaYaqmuri commented Jan 12, 2024

NimaYaqmuri commented Jan 12, 2024

ylacombe left a comment

Choose a reason for hiding this comment

ylacombe Jan 15, 2024

Choose a reason for hiding this comment

amyeroberts Jan 15, 2024

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Jan 15, 2024

Choose a reason for hiding this comment

NimaYaqmuri commented Jan 16, 2024

amyeroberts commented Jan 16, 2024

NimaYaqmuri commented Jan 12, 2024 •

edited

Loading