vad : add streaming detect + explicit state reset#3677
Open
danielbodart wants to merge 1 commit intoggml-org:masterfrom
Open
vad : add streaming detect + explicit state reset#3677danielbodart wants to merge 1 commit intoggml-org:masterfrom
danielbodart wants to merge 1 commit intoggml-org:masterfrom
Conversation
whisper_vad_detect_speech resets LSTM state on every call, which is correct for batch processing but prevents temporal continuity when calling per-chunk in a streaming loop. Add whisper_vad_detect_speech_no_reset (skips buffer clear) and whisper_vad_reset_state (explicit clear between utterances). Existing whisper_vad_detect_speech is now a thin wrapper — zero behavior change for current callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
whisper_vad_detect_speech_no_reset()— identical towhisper_vad_detect_speechbut does not reset LSTM hidden/cell state, enabling temporal continuity when calling per-chunk in a streaming loopwhisper_vad_reset_state()— explicit state reset for use between utteranceswhisper_vad_detect_speechas a thin wrapper (reset + no_reset) — zero behavior change for existing callersMotivation
whisper_vad_detect_speechcallsggml_backend_buffer_clear(vctx->buffer, 0)on every invocation, which resets the Silero LSTM hidden/cell states. This is correct for batch processing (the current use case), but prevents temporal continuity when calling per-chunk in a streaming loop — the LSTM effectively degrades to a feedforward classifier with no memory between chunks.For streaming applications that call VAD once per chunk (e.g. 512 samples at 16kHz = 32ms), the model needs to carry state across calls to make use of its recurrent architecture.
Changes
Two new public API functions following existing naming conventions:
whisper_vad_detect_speechis nowreset + no_reset— existing callers (includingwhisper_vad_segments_from_samples,test-vad.cpp,examples/speech.cpp) are completely unaffected.