-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Add dithering to the Speech2TextFeatureExtractor API.
#34638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7706bae to
668bf55
Compare
|
Hello, is there somebody to look into this for a review ? |
|
cc @ylacombe @eustlb @Vaibhavs10 regarding this proposed feature contribution |
Vaibhavs10
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be supportive of this generally. For more context @KarelVesely84 - How are you using this in training with Kaldi and for which models? Are you using the example training/ fine-tuning script?
|
Hello, |
- in kaldi : https://github.com/kaldi-asr/kaldi/blob/4a8b7f673275597fef8a15b160124bd0985b59bd/src/feat/feature-window.cc#L145 - with dithering without a seed, the features become non-deterministic due to small Gaussian noise added to the audio (i.e. 2 runs lead to little different outputs)
668bf55 to
696c984
Compare
ylacombe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @KarelVesely84, thanks for adding this, it looks great to me!
You could maybe add a small test to the feature extraction tests of speech2text, just in case, to ensure that we get the expected results with a set seed.
src/transformers/audio_utils.py
Outdated
| Add dithering (add small Gaussian noise to each frame). | ||
| E.g. use 4 to add dithering, 0.0 means no dithering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit)
| Add dithering (add small Gaussian noise to each frame). | |
| E.g. use 4 to add dithering, 0.0 means no dithering. | |
| Adds dithering. In other words, adds a small Gaussian noise to each frame. | |
| E.g. use 4.0 to add dithering with a normal distribution centered around 0.0 with standard deviation 4.0. 0.0 means no dithering. |
src/transformers/audio_utils.py
Outdated
| onesided (`bool`, *optional*, defaults to `True`): | ||
| If True, returns a one-sided spectrogram for real input signals. | ||
| dither (`float`): | ||
| Add dithering (add small Gaussian noise to each frame). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same suggestion as above here and for the rest of the PR!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Hello, Where should be the test of feature extractor located in the code ? Best, |
- add dithering also for WhisperFeatureExtractor - not adding to Wav2Vec2FeatureExtractor (no FBANK computation)
|
Hello, It is a dummy example, so training is without dithering and inference is with dithering. |
|
I see the feature extraction tests, but how sholud the test actually look like ? |
|
Hello @ylacombe , the feedback was integrated into the PR. Is there something more to do ? |
ylacombe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I see the feature extraction tests, but how sholud the test actually look like ?
T1: Compute features with / without dithering and make sure the outputs are different, and not "too different" ?
Exactly! Let's do this before merging
src/transformers/audio_utils.py
Outdated
| If True, only computes the positive frequencies and returns a spectrogram containing `fft_length // 2 + 1` | ||
| frequency bins. If False, also computes the negative frequencies and returns `fft_length` frequency bins. | ||
| dither (`float`): | ||
| Adds dithering. In other words, adds a small Gaussian noise to each frame). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Adds dithering. In other words, adds a small Gaussian noise to each frame). | |
| Adds dithering. In other words, adds a small Gaussian noise to each frame. |
src/transformers/audio_utils.py
Outdated
| onesided (`bool`, *optional*, defaults to `True`): | ||
| If True, returns a one-sided spectrogram for real input signals. | ||
| dither (`float`): | ||
| Adds dithering. In other words, adds a small Gaussian noise to each frame). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Adds dithering. In other words, adds a small Gaussian noise to each frame). | |
| Adds dithering. In other words, adds a small Gaussian noise to each frame. |
| padding_value (`float`, *optional*, defaults to 0.0): | ||
| The value that is used to fill the padding vectors. | ||
| dither (`float`, *optional*, defaults to 0.0): | ||
| Adds dithering. In other words, adds a small Gaussian noise to each frame). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Adds dithering. In other words, adds a small Gaussian noise to each frame). | |
| Adds dithering. In other words, adds a small Gaussian noise to each frame. |
| padding_value (`float`, *optional*, defaults to 0.0): | ||
| Padding value used to pad the audio. Should correspond to silences. | ||
| dither (`float`, *optional*, defaults to 0.0): | ||
| Adds dithering. In other words, adds a small Gaussian noise to each frame). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Adds dithering. In other words, adds a small Gaussian noise to each frame). | |
| Adds dithering. In other words, adds a small Gaussian noise to each frame. |
src/transformers/audio_utils.py
Outdated
| onesided (`bool`, *optional*, defaults to `True`): | ||
| If True, only computes the positive frequencies and returns a spectrogram containing `fft_length // 2 + 1` | ||
| frequency bins. If False, also computes the negative frequencies and returns `fft_length` frequency bins. | ||
| dither (`float`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also make sure that the docstrings is right here and in the rest of the docstrings
| dither (`float`): | |
| dither (`float`, *optional*, defaults to 0.0): |
|
I had to change Tests called with : |
|
Ok, it seems to be ready. |
ylacombe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for iterating
@ArthurZucker, could you review when you have time? Thanks!
src/transformers/audio_utils.py
Outdated
| buffer[:frame_length] = waveform[timestep : timestep + frame_length] | ||
|
|
||
| if dither != 0.0: | ||
| buffer[:frame_length] += dither * np.random.randn(*buffer[:frame_length].shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docstrings, waveform is supposed to be a 1D signal. In that case, it should be simpler to do it as proposed?
| buffer[:frame_length] += dither * np.random.randn(*buffer[:frame_length].shape) | |
| buffer[:frame_length] += dither * np.random.randn(frame_length) |
Or maybe len(buffer) ?
| @require_torch | ||
| @require_torchaudio | ||
| # Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTester with Whisper->Clap | ||
| class ClapFeatureExtractionTester(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove this part, out of curiosity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some copy consistency gh-workflow test was failing, suggesting to run a script which modified it this way
it was the "check_repository_consistency" that was failing...
| self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3)) | ||
|
|
||
| def test_dither(self): | ||
| # Tests that features with and without little dithering are similar, but not the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's set the seed here, to ensure reproducibility.
| self.assertTrue(np.abs(diff).mean() <= 1e-3) | ||
| self.assertTrue(np.abs(diff).max() <= 1e-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
src/transformers/audio_utils.py
Outdated
| The padding strategy when `center` is `True`. | ||
| onesided (`bool`, *optional*, defaults to `True`): | ||
| If True, returns a one-sided spectrogram for real input signals. | ||
| dither (`float`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dither (`float`): | |
| dither (`float`, *optional*, defaults to 0.0): |
|
Hello, is there something other to be done ? |
|
cc @eustlb @Vaibhavs10, sorry for the ping but can one of you pick this up and finalize the review? |
Vaibhavs10
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yoach already reviewed it from Audio side, I think we're only looking for a core-maintainer review + merge now: #34638 (review)
|
Got it - pinging @ArthurZucker @LysandreJik for core maintainer review! |
|
Reviewing in a bit! |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, would be nice to just say why people should use this in the doc!
| Adds dithering. In other words, adds a small Gaussian noise to each frame. | ||
| E.g. use 4.0 to add dithering with a normal distribution centered | ||
| around 0.0 with standard deviation 4.0, 0.0 means no dithering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add the comment about "this can help for hard audio in ASR" 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, added the explanatory comments, thanks!
|
could you resolve the last conflicts and we should be good to merge! |
ok, done |
|
Thanks for the PR! 🤗 |
I encountered this problem with Whisper hallucinating on hard zeros. Could you point me in a direction where it is explained how these lead to OOD data? I thought it was caused by normalization at first, but I can't tell the difference between dithered and non-dithered spectrograms when I visualize them. So how do they manage to trip up models? |
Enable dithering in speech-to-text feature extraction. Dithering exists in the original kaldi features.
The dithering is adding small Gaussian noise to the waveform on input of feature extraction.
This is helpful for audio signals with hard-zero sections due to HW VAD, these hard-zeros
may break the ASR training or inference if they appear in the data.
With dithering without a seed, the features become non-deterministic due to small Gaussian noise added to the audio (i.e. 2 runs lead to little different outputs). When debugging feature extraction code, it is good to set dithering to 0.0 (i.e. default value).