vad : add streaming detect + explicit state reset by danielbodart · Pull Request #3677 · ggml-org/whisper.cpp

danielbodart · 2026-02-23T17:15:15Z

Summary

Add whisper_vad_detect_speech_no_reset() — identical to whisper_vad_detect_speech but does not reset LSTM hidden/cell state, enabling temporal continuity when calling per-chunk in a streaming loop
Add whisper_vad_reset_state() — explicit state reset for use between utterances
Refactor whisper_vad_detect_speech as a thin wrapper (reset + no_reset) — zero behavior change for existing callers

Motivation

whisper_vad_detect_speech calls ggml_backend_buffer_clear(vctx->buffer, 0) on every invocation, which resets the Silero LSTM hidden/cell states. This is correct for batch processing (the current use case), but prevents temporal continuity when calling per-chunk in a streaming loop — the LSTM effectively degrades to a feedforward classifier with no memory between chunks.

For streaming applications that call VAD once per chunk (e.g. 512 samples at 16kHz = 32ms), the model needs to carry state across calls to make use of its recurrent architecture.

Changes

Two new public API functions following existing naming conventions:

// Like whisper_vad_detect_speech, but does not reset LSTM state.
// Use for streaming: call whisper_vad_reset_state() between utterances.
WHISPER_API bool whisper_vad_detect_speech_no_reset(
        struct whisper_vad_context * vctx,
        const float * samples,
        int   n_samples);

// Reset LSTM hidden/cell states to zero.
WHISPER_API void whisper_vad_reset_state(struct whisper_vad_context * vctx);

whisper_vad_detect_speech is now reset + no_reset — existing callers (including whisper_vad_segments_from_samples, test-vad.cpp, examples/speech.cpp) are completely unaffected.

whisper_vad_detect_speech resets LSTM state on every call, which is correct for batch processing but prevents temporal continuity when calling per-chunk in a streaming loop. Add whisper_vad_detect_speech_no_reset (skips buffer clear) and whisper_vad_reset_state (explicit clear between utterances). Existing whisper_vad_detect_speech is now a thin wrapper — zero behavior change for current callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

danielbodart mentioned this pull request Feb 24, 2026

Make Silero VAD stateful across calls (carry LSTM state) danielbodart/capsper#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vad : add streaming detect + explicit state reset#3677

vad : add streaming detect + explicit state reset#3677
danielbodart wants to merge 1 commit intoggml-org:masterfrom
danielbodart:streaming-vad-state-upstream

danielbodart commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielbodart commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielbodart commented Feb 23, 2026 •

edited

Loading