Add dithering to the `Speech2TextFeatureExtractor` API. #34638

KarelVesely84 · 2024-11-07T12:46:46Z

Enable dithering in speech-to-text feature extraction. Dithering exists in the original kaldi features.

in kaldi : https://github.com/kaldi-asr/kaldi/blob/4a8b7f673275597fef8a15b160124bd0985b59bd/src/feat/feature-window.cc#L145

The dithering is adding small Gaussian noise to the waveform on input of feature extraction.
This is helpful for audio signals with hard-zero sections due to HW VAD, these hard-zeros
may break the ASR training or inference if they appear in the data.

With dithering without a seed, the features become non-deterministic due to small Gaussian noise added to the audio (i.e. 2 runs lead to little different outputs). When debugging feature extraction code, it is good to set dithering to 0.0 (i.e. default value).

KarelVesely84 · 2024-11-11T09:06:10Z

Hello, is there somebody to look into this for a review ?
Thank you,
K.V.

LysandreJik · 2024-11-18T13:48:25Z

cc @ylacombe @eustlb @Vaibhavs10 regarding this proposed feature contribution

Vaibhavs10

I'd be supportive of this generally. For more context @KarelVesely84 - How are you using this in training with Kaldi and for which models? Are you using the example training/ fine-tuning script?

KarelVesely84 · 2024-11-19T15:15:30Z

Hello,
we use it in a custom recipe: HF features, custom ebranchformer encoder derived from HF, and k2/icefall fine-tuning.
I'll ask a colleauge to prepare a demo with Wav2VecConformer example recipe (enable dithering for decoding with an existing model). Would that be ok ?
K.

- in kaldi : https://github.com/kaldi-asr/kaldi/blob/4a8b7f673275597fef8a15b160124bd0985b59bd/src/feat/feature-window.cc#L145 - with dithering without a seed, the features become non-deterministic due to small Gaussian noise added to the audio (i.e. 2 runs lead to little different outputs)

ylacombe

Hey @KarelVesely84, thanks for adding this, it looks great to me!

You could maybe add a small test to the feature extraction tests of speech2text, just in case, to ensure that we get the expected results with a set seed.

ylacombe · 2024-11-21T17:18:56Z

src/transformers/audio_utils.py

+            Add dithering (add small Gaussian noise to each frame).
+            E.g. use 4 to add dithering, 0.0 means no dithering.


(nit)

Suggested change

Add dithering (add small Gaussian noise to each frame).

E.g. use 4 to add dithering, 0.0 means no dithering.

Adds dithering. In other words, adds a small Gaussian noise to each frame.

E.g. use 4.0 to add dithering with a normal distribution centered around 0.0 with standard deviation 4.0. 0.0 means no dithering.

ylacombe · 2024-11-21T17:19:15Z

src/transformers/audio_utils.py

        onesided (`bool`, *optional*, defaults to `True`):
            If True, returns a one-sided spectrogram for real input signals.
+        dither (`float`):
+            Add dithering (add small Gaussian noise to each frame).


Same suggestion as above here and for the rest of the PR!

HuggingFaceDocBuilderDev · 2024-11-21T17:47:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

KarelVesely84 · 2024-11-22T09:34:31Z

Hello,
I'd like to extend it also to other speech FeatureExtractor classes.
Trying to add it as an arg of Processor class it in a demo.

Where should be the test of feature extractor located in the code ?

Best,
K.

- add dithering also for WhisperFeatureExtractor - not adding to Wav2Vec2FeatureExtractor (no FBANK computation)

KarelVesely84 · 2024-11-22T13:59:47Z

Hello,
here is the correspoding example code, it does inference from existing models:
https://github.com/KarelVesely84/transformers_sandbox/tree/main/dithering

It is a dummy example, so training is without dithering and inference is with dithering.
The worsening results demonstrate that dithering is happening.
K.

KarelVesely84 · 2024-11-22T15:40:09Z

I see the feature extraction tests, but how sholud the test actually look like ?
T1: Compute features with / without dithering and make sure the outputs are different, and not "too different" ?

KarelVesely84 · 2024-11-29T16:09:35Z

Hello @ylacombe , the feedback was integrated into the PR. Is there something more to do ?
Best regards,
Karel

ylacombe

LGTM!

I see the feature extraction tests, but how sholud the test actually look like ?
T1: Compute features with / without dithering and make sure the outputs are different, and not "too different" ?

Exactly! Let's do this before merging

ylacombe · 2024-12-02T12:47:47Z

src/transformers/audio_utils.py

            If True, only computes the positive frequencies and returns a spectrogram containing `fft_length // 2 + 1`
            frequency bins. If False, also computes the negative frequencies and returns `fft_length` frequency bins.
+        dither (`float`):
+            Adds dithering. In other words, adds a small Gaussian noise to each frame).


Suggested change

Adds dithering. In other words, adds a small Gaussian noise to each frame).

Adds dithering. In other words, adds a small Gaussian noise to each frame.

ylacombe · 2024-12-02T12:47:57Z

src/transformers/audio_utils.py

        onesided (`bool`, *optional*, defaults to `True`):
            If True, returns a one-sided spectrogram for real input signals.
+        dither (`float`):
+            Adds dithering. In other words, adds a small Gaussian noise to each frame).


Suggested change

Adds dithering. In other words, adds a small Gaussian noise to each frame).

Adds dithering. In other words, adds a small Gaussian noise to each frame.

ylacombe · 2024-12-02T12:48:06Z

src/transformers/models/speech_to_text/feature_extraction_speech_to_text.py

        padding_value (`float`, *optional*, defaults to 0.0):
            The value that is used to fill the padding vectors.
+        dither (`float`, *optional*, defaults to 0.0):
+            Adds dithering. In other words, adds a small Gaussian noise to each frame).


Suggested change

Adds dithering. In other words, adds a small Gaussian noise to each frame).

Adds dithering. In other words, adds a small Gaussian noise to each frame.

ylacombe · 2024-12-02T12:48:17Z

src/transformers/models/whisper/feature_extraction_whisper.py

        padding_value (`float`, *optional*, defaults to 0.0):
            Padding value used to pad the audio. Should correspond to silences.
+        dither (`float`, *optional*, defaults to 0.0):
+            Adds dithering. In other words, adds a small Gaussian noise to each frame).


Suggested change

Adds dithering. In other words, adds a small Gaussian noise to each frame).

Adds dithering. In other words, adds a small Gaussian noise to each frame.

ylacombe · 2024-12-02T12:51:17Z

src/transformers/audio_utils.py

        onesided (`bool`, *optional*, defaults to `True`):
            If True, only computes the positive frequencies and returns a spectrogram containing `fft_length // 2 + 1`
            frequency bins. If False, also computes the negative frequencies and returns `fft_length` frequency bins.
+        dither (`float`):


You should also make sure that the docstrings is right here and in the rest of the docstrings

Suggested change

dither (`float`):

dither (`float`, *optional*, defaults to 0.0):

KarelVesely84 · 2024-12-06T21:20:44Z

I had to change class ***FeatureExtractionTester(unittest.TestCase): -> class ***FeatureExtractionTester: in unit tests.
Otherwise the unit tests were crashing, as there is no test_*() method in the Tester class.

Tests called with : python -m pytest -s */test_*.py (pytest 8.3.4)

KarelVesely84 · 2024-12-06T21:22:28Z

Ok, it seems to be ready.

ylacombe

LGTM! Thanks for iterating

@ArthurZucker, could you review when you have time? Thanks!

ylacombe · 2024-12-09T10:07:06Z

src/transformers/audio_utils.py

        buffer[:frame_length] = waveform[timestep : timestep + frame_length]

+        if dither != 0.0:
+            buffer[:frame_length] += dither * np.random.randn(*buffer[:frame_length].shape)


According to the docstrings, waveform is supposed to be a 1D signal. In that case, it should be simpler to do it as proposed?

Suggested change

buffer[:frame_length] += dither * np.random.randn(*buffer[:frame_length].shape)

buffer[:frame_length] += dither * np.random.randn(frame_length)

Or maybe len(buffer) ?

ylacombe · 2024-12-09T10:08:25Z

tests/models/clap/test_feature_extraction_clap.py

 @require_torch
 @require_torchaudio
 # Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTester with Whisper->Clap
-class ClapFeatureExtractionTester(unittest.TestCase):


Why did you remove this part, out of curiosity?

some copy consistency gh-workflow test was failing, suggesting to run a script which modified it this way

it was the "check_repository_consistency" that was failing...

ylacombe · 2024-12-09T10:09:26Z

tests/models/speech_to_text/test_feature_extraction_speech_to_text.py

            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))

+    def test_dither(self):
+        # Tests that features with and without little dithering are similar, but not the same


Let's set the seed here, to ensure reproducibility.

ylacombe · 2024-12-09T10:09:57Z

tests/models/speech_to_text/test_feature_extraction_speech_to_text.py

+        self.assertTrue(np.abs(diff).mean() <= 1e-3)
+        self.assertTrue(np.abs(diff).max() <= 1e-2)


ylacombe · 2024-12-09T10:10:37Z

src/transformers/audio_utils.py

            The padding strategy when `center` is `True`.
        onesided (`bool`, *optional*, defaults to `True`):
            If True, returns a one-sided spectrogram for real input signals.
+        dither (`float`):


Suggested change

dither (`float`):

dither (`float`, *optional*, defaults to 0.0):

KarelVesely84 · 2024-12-19T11:13:05Z

Hello, is there something other to be done ?
Best regards,
K. Vesely

Rocketknight1 · 2024-12-19T14:18:52Z

cc @eustlb @Vaibhavs10, sorry for the ping but can one of you pick this up and finalize the review?

Vaibhavs10

Yoach already reviewed it from Audio side, I think we're only looking for a core-maintainer review + merge now: #34638 (review)

Rocketknight1 · 2024-12-19T14:28:53Z

Got it - pinging @ArthurZucker @LysandreJik for core maintainer review!

ArthurZucker · 2025-01-07T16:28:29Z

Reviewing in a bit!

ArthurZucker

Thanks, would be nice to just say why people should use this in the doc!

ArthurZucker · 2025-01-07T16:30:59Z

src/transformers/audio_utils.py

+            Adds dithering. In other words, adds a small Gaussian noise to each frame.
+            E.g. use 4.0 to add dithering with a normal distribution centered
+            around 0.0 with standard deviation 4.0, 0.0 means no dithering.


let's add the comment about "this can help for hard audio in ASR" 😉

ok, added the explanatory comments, thanks!

ArthurZucker · 2025-02-11T17:08:16Z

could you resolve the last conflicts and we should be good to merge!

KarelVesely84 · 2025-02-13T09:05:37Z

could you resolve the last conflicts and we should be good to merge!

ok, done

ArthurZucker · 2025-02-19T10:50:10Z

Thanks for the PR! 🤗

mizoru · 2025-05-22T17:15:42Z

@KarelVesely84

these hard-zeros may break the ASR training or inference if they appear in the data

I encountered this problem with Whisper hallucinating on hard zeros. Could you point me in a direction where it is explained how these lead to OOD data? I thought it was caused by normalization at first, but I can't tell the difference between dithered and non-dithered spectrograms when I visualize them. So how do they manage to trip up models?

KarelVesely84 force-pushed the add_dither branch 3 times, most recently from 7706bae to 668bf55 Compare November 7, 2024 13:31

Vaibhavs10 reviewed Nov 18, 2024

View reviewed changes

KarelVesely84 force-pushed the add_dither branch from 668bf55 to 696c984 Compare November 20, 2024 13:10

Vaibhavs10 requested a review from ylacombe November 21, 2024 16:38

ylacombe approved these changes Nov 21, 2024

View reviewed changes

update the PR

9d9382b

- add dithering also for WhisperFeatureExtractor - not adding to Wav2Vec2FeatureExtractor (no FBANK computation)

ylacombe approved these changes Dec 2, 2024

View reviewed changes

ylacombe reviewed Dec 2, 2024

View reviewed changes

KarelVesely84 added 4 commits December 6, 2024 21:33

add unit-tests for dithering, fix docstrings

b9a39a7

ruff

c2087ea

utils/check_copies.py --fix_and_overwrite

2422fdb

Merge remote-tracking branch 'origin/main' into add_dither

042699d

ylacombe approved these changes Dec 9, 2024

View reviewed changes

ylacombe requested a review from ArthurZucker December 9, 2024 10:11

KarelVesely84 added 2 commits December 10, 2024 11:31

update code, add seed to unit-test

8658a70

Merge branch 'main' into add_dither

dda4695

Vaibhavs10 reviewed Dec 19, 2024

View reviewed changes

ArthurZucker approved these changes Jan 7, 2025

View reviewed changes

KarelVesely84 added 2 commits January 9, 2025 11:38

adding explanation of dithering

c182442

Merge remote-tracking branch 'origin/main' into add_dither

2fd31df

KarelVesely84 requested a review from Rocketknight1 as a code owner January 9, 2025 10:39

Merge remote-tracking branch 'origin/main' into add_dither

7264638

ArthurZucker merged commit 1a81d77 into huggingface:main Feb 19, 2025
18 checks passed

gante mentioned this pull request Feb 19, 2025

[tests] deflake dither test #36284

Merged

		Add dithering (add small Gaussian noise to each frame).
		E.g. use 4 to add dithering, 0.0 means no dithering.

	Adds dithering. In other words, adds a small Gaussian noise to each frame).
	Adds dithering. In other words, adds a small Gaussian noise to each frame.

	dither (`float`):
	dither (`float`, optional, defaults to 0.0):

	buffer[:frame_length] += dither * np.random.randn(*buffer[:frame_length].shape)
	buffer[:frame_length] += dither * np.random.randn(frame_length)

		self.assertTrue(np.abs(diff).mean() <= 1e-3)
		self.assertTrue(np.abs(diff).max() <= 1e-2)

Add dithering to the Speech2TextFeatureExtractor API. #34638

Add dithering to the Speech2TextFeatureExtractor API. #34638

Uh oh!

Conversation

KarelVesely84 commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KarelVesely84 commented Nov 11, 2024

Uh oh!

LysandreJik commented Nov 18, 2024

Uh oh!

Vaibhavs10 left a comment

Choose a reason for hiding this comment

Uh oh!

KarelVesely84 commented Nov 19, 2024

Uh oh!

ylacombe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 21, 2024

Uh oh!

KarelVesely84 commented Nov 22, 2024

Uh oh!

KarelVesely84 commented Nov 22, 2024

Uh oh!

KarelVesely84 commented Nov 22, 2024

Uh oh!

KarelVesely84 commented Nov 29, 2024

Uh oh!

ylacombe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KarelVesely84 commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KarelVesely84 commented Dec 6, 2024

Uh oh!

ylacombe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KarelVesely84 Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KarelVesely84 commented Dec 19, 2024

Uh oh!

Rocketknight1 commented Dec 19, 2024

Uh oh!

Vaibhavs10 left a comment

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Dec 19, 2024

Uh oh!

ArthurZucker commented Jan 7, 2025

Uh oh!

ArthurZucker left a comment

Add dithering to the `Speech2TextFeatureExtractor` API. #34638

Add dithering to the `Speech2TextFeatureExtractor` API. #34638

KarelVesely84 commented Nov 7, 2024 •

edited

Loading

KarelVesely84 commented Dec 6, 2024 •

edited

Loading

KarelVesely84 Dec 10, 2024 •

edited

Loading

mizoru commented May 22, 2025 •

edited

Loading