Skip to content

vad : add streaming detect + explicit state reset#3677

Open
danielbodart wants to merge 1 commit intoggml-org:masterfrom
danielbodart:streaming-vad-state-upstream
Open

vad : add streaming detect + explicit state reset#3677
danielbodart wants to merge 1 commit intoggml-org:masterfrom
danielbodart:streaming-vad-state-upstream

Conversation

@danielbodart
Copy link

@danielbodart danielbodart commented Feb 23, 2026

Summary

  • Add whisper_vad_detect_speech_no_reset() — identical to whisper_vad_detect_speech but does not reset LSTM hidden/cell state, enabling temporal continuity when calling per-chunk in a streaming loop
  • Add whisper_vad_reset_state() — explicit state reset for use between utterances
  • Refactor whisper_vad_detect_speech as a thin wrapper (reset + no_reset) — zero behavior change for existing callers

Motivation

whisper_vad_detect_speech calls ggml_backend_buffer_clear(vctx->buffer, 0) on every invocation, which resets the Silero LSTM hidden/cell states. This is correct for batch processing (the current use case), but prevents temporal continuity when calling per-chunk in a streaming loop — the LSTM effectively degrades to a feedforward classifier with no memory between chunks.

For streaming applications that call VAD once per chunk (e.g. 512 samples at 16kHz = 32ms), the model needs to carry state across calls to make use of its recurrent architecture.

Changes

Two new public API functions following existing naming conventions:

// Like whisper_vad_detect_speech, but does not reset LSTM state.
// Use for streaming: call whisper_vad_reset_state() between utterances.
WHISPER_API bool whisper_vad_detect_speech_no_reset(
        struct whisper_vad_context * vctx,
        const float * samples,
        int   n_samples);

// Reset LSTM hidden/cell states to zero.
WHISPER_API void whisper_vad_reset_state(struct whisper_vad_context * vctx);

whisper_vad_detect_speech is now reset + no_reset — existing callers (including whisper_vad_segments_from_samples, test-vad.cpp, examples/speech.cpp) are completely unaffected.

whisper_vad_detect_speech resets LSTM state on every call, which is
correct for batch processing but prevents temporal continuity when
calling per-chunk in a streaming loop.

Add whisper_vad_detect_speech_no_reset (skips buffer clear) and
whisper_vad_reset_state (explicit clear between utterances).
Existing whisper_vad_detect_speech is now a thin wrapper — zero
behavior change for current callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant