Skip to content

Conversation

@KarelVesely84
Copy link
Contributor

@KarelVesely84 KarelVesely84 commented Nov 7, 2024

Enable dithering in speech-to-text feature extraction. Dithering exists in the original kaldi features.

The dithering is adding small Gaussian noise to the waveform on input of feature extraction.
This is helpful for audio signals with hard-zero sections due to HW VAD, these hard-zeros
may break the ASR training or inference if they appear in the data.

With dithering without a seed, the features become non-deterministic due to small Gaussian noise added to the audio (i.e. 2 runs lead to little different outputs). When debugging feature extraction code, it is good to set dithering to 0.0 (i.e. default value).

@KarelVesely84 KarelVesely84 force-pushed the add_dither branch 3 times, most recently from 7706bae to 668bf55 Compare November 7, 2024 13:31
@KarelVesely84
Copy link
Contributor Author

Hello, is there somebody to look into this for a review ?
Thank you,
K.V.

@LysandreJik
Copy link
Member

cc @ylacombe @eustlb @Vaibhavs10 regarding this proposed feature contribution

Copy link
Contributor

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be supportive of this generally. For more context @KarelVesely84 - How are you using this in training with Kaldi and for which models? Are you using the example training/ fine-tuning script?

@KarelVesely84
Copy link
Contributor Author

Hello,
we use it in a custom recipe: HF features, custom ebranchformer encoder derived from HF, and k2/icefall fine-tuning.
I'll ask a colleauge to prepare a demo with Wav2VecConformer example recipe (enable dithering for decoding with an existing model). Would that be ok ?
K.

- in kaldi : https://github.com/kaldi-asr/kaldi/blob/4a8b7f673275597fef8a15b160124bd0985b59bd/src/feat/feature-window.cc#L145
- with dithering without a seed, the features become non-deterministic due
  to small Gaussian noise added to the audio (i.e. 2 runs lead to little
  different outputs)
Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @KarelVesely84, thanks for adding this, it looks great to me!

You could maybe add a small test to the feature extraction tests of speech2text, just in case, to ensure that we get the expected results with a set seed.

Comment on lines 465 to 466
Add dithering (add small Gaussian noise to each frame).
E.g. use 4 to add dithering, 0.0 means no dithering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)

Suggested change
Add dithering (add small Gaussian noise to each frame).
E.g. use 4 to add dithering, 0.0 means no dithering.
Adds dithering. In other words, adds a small Gaussian noise to each frame.
E.g. use 4.0 to add dithering with a normal distribution centered around 0.0 with standard deviation 4.0. 0.0 means no dithering.

onesided (`bool`, *optional*, defaults to `True`):
If True, returns a one-sided spectrogram for real input signals.
dither (`float`):
Add dithering (add small Gaussian noise to each frame).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion as above here and for the rest of the PR!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@KarelVesely84
Copy link
Contributor Author

Hello,
I'd like to extend it also to other speech FeatureExtractor classes.
Trying to add it as an arg of Processor class it in a demo.

Where should be the test of feature extractor located in the code ?

Best,
K.

- add dithering also for WhisperFeatureExtractor
- not adding to Wav2Vec2FeatureExtractor (no FBANK computation)
@KarelVesely84
Copy link
Contributor Author

Hello,
here is the correspoding example code, it does inference from existing models:
https://github.com/KarelVesely84/transformers_sandbox/tree/main/dithering

It is a dummy example, so training is without dithering and inference is with dithering.
The worsening results demonstrate that dithering is happening.
K.

@KarelVesely84
Copy link
Contributor Author

I see the feature extraction tests, but how sholud the test actually look like ?
T1: Compute features with / without dithering and make sure the outputs are different, and not "too different" ?

@KarelVesely84
Copy link
Contributor Author

Hello @ylacombe , the feedback was integrated into the PR. Is there something more to do ?
Best regards,
Karel

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I see the feature extraction tests, but how sholud the test actually look like ?
T1: Compute features with / without dithering and make sure the outputs are different, and not "too different" ?

Exactly! Let's do this before merging

If True, only computes the positive frequencies and returns a spectrogram containing `fft_length // 2 + 1`
frequency bins. If False, also computes the negative frequencies and returns `fft_length` frequency bins.
dither (`float`):
Adds dithering. In other words, adds a small Gaussian noise to each frame).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Adds dithering. In other words, adds a small Gaussian noise to each frame).
Adds dithering. In other words, adds a small Gaussian noise to each frame.

onesided (`bool`, *optional*, defaults to `True`):
If True, returns a one-sided spectrogram for real input signals.
dither (`float`):
Adds dithering. In other words, adds a small Gaussian noise to each frame).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Adds dithering. In other words, adds a small Gaussian noise to each frame).
Adds dithering. In other words, adds a small Gaussian noise to each frame.

padding_value (`float`, *optional*, defaults to 0.0):
The value that is used to fill the padding vectors.
dither (`float`, *optional*, defaults to 0.0):
Adds dithering. In other words, adds a small Gaussian noise to each frame).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Adds dithering. In other words, adds a small Gaussian noise to each frame).
Adds dithering. In other words, adds a small Gaussian noise to each frame.

padding_value (`float`, *optional*, defaults to 0.0):
Padding value used to pad the audio. Should correspond to silences.
dither (`float`, *optional*, defaults to 0.0):
Adds dithering. In other words, adds a small Gaussian noise to each frame).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Adds dithering. In other words, adds a small Gaussian noise to each frame).
Adds dithering. In other words, adds a small Gaussian noise to each frame.

onesided (`bool`, *optional*, defaults to `True`):
If True, only computes the positive frequencies and returns a spectrogram containing `fft_length // 2 + 1`
frequency bins. If False, also computes the negative frequencies and returns `fft_length` frequency bins.
dither (`float`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also make sure that the docstrings is right here and in the rest of the docstrings

Suggested change
dither (`float`):
dither (`float`, *optional*, defaults to 0.0):

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Dec 6, 2024

I had to change class ***FeatureExtractionTester(unittest.TestCase): -> class ***FeatureExtractionTester: in unit tests.
Otherwise the unit tests were crashing, as there is no test_*() method in the Tester class.

Tests called with : python -m pytest -s */test_*.py (pytest 8.3.4)

@KarelVesely84
Copy link
Contributor Author

Ok, it seems to be ready.

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for iterating

@ArthurZucker, could you review when you have time? Thanks!

buffer[:frame_length] = waveform[timestep : timestep + frame_length]

if dither != 0.0:
buffer[:frame_length] += dither * np.random.randn(*buffer[:frame_length].shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docstrings, waveform is supposed to be a 1D signal. In that case, it should be simpler to do it as proposed?

Suggested change
buffer[:frame_length] += dither * np.random.randn(*buffer[:frame_length].shape)
buffer[:frame_length] += dither * np.random.randn(frame_length)

Or maybe len(buffer) ?

@require_torch
@require_torchaudio
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTester with Whisper->Clap
class ClapFeatureExtractionTester(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this part, out of curiosity?

Copy link
Contributor Author

@KarelVesely84 KarelVesely84 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some copy consistency gh-workflow test was failing, suggesting to run a script which modified it this way

it was the "check_repository_consistency" that was failing...

self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))

def test_dither(self):
# Tests that features with and without little dithering are similar, but not the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set the seed here, to ensure reproducibility.

Comment on lines +176 to +177
self.assertTrue(np.abs(diff).mean() <= 1e-3)
self.assertTrue(np.abs(diff).max() <= 1e-2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

The padding strategy when `center` is `True`.
onesided (`bool`, *optional*, defaults to `True`):
If True, returns a one-sided spectrogram for real input signals.
dither (`float`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dither (`float`):
dither (`float`, *optional*, defaults to 0.0):

@ylacombe ylacombe requested a review from ArthurZucker December 9, 2024 10:11
@KarelVesely84
Copy link
Contributor Author

Hello, is there something other to be done ?
Best regards,
K. Vesely

@Rocketknight1
Copy link
Member

cc @eustlb @Vaibhavs10, sorry for the ping but can one of you pick this up and finalize the review?

Copy link
Contributor

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yoach already reviewed it from Audio side, I think we're only looking for a core-maintainer review + merge now: #34638 (review)

@Rocketknight1
Copy link
Member

Got it - pinging @ArthurZucker @LysandreJik for core maintainer review!

@ArthurZucker
Copy link
Collaborator

Reviewing in a bit!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, would be nice to just say why people should use this in the doc!

Comment on lines +465 to +467
Adds dithering. In other words, adds a small Gaussian noise to each frame.
E.g. use 4.0 to add dithering with a normal distribution centered
around 0.0 with standard deviation 4.0, 0.0 means no dithering.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add the comment about "this can help for hard audio in ASR" 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, added the explanatory comments, thanks!

@ArthurZucker
Copy link
Collaborator

could you resolve the last conflicts and we should be good to merge!

@KarelVesely84
Copy link
Contributor Author

could you resolve the last conflicts and we should be good to merge!

ok, done

@ArthurZucker ArthurZucker merged commit 1a81d77 into huggingface:main Feb 19, 2025
18 checks passed
@ArthurZucker
Copy link
Collaborator

Thanks for the PR! 🤗

@mizoru
Copy link

mizoru commented May 22, 2025

@KarelVesely84

these hard-zeros may break the ASR training or inference if they appear in the data

I encountered this problem with Whisper hallucinating on hard zeros. Could you point me in a direction where it is explained how these lead to OOD data? I thought it was caused by normalization at first, but I can't tell the difference between dithered and non-dithered spectrograms when I visualize them. So how do they manage to trip up models?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants