Skip to content

Make Silero VAD stateful across calls (carry LSTM state) #1

@danielbodart

Description

@danielbodart

Problem

whisper.cpp's whisper_vad_detect_speech resets LSTM hidden/cell states on every call (ggml_backend_buffer_clear(vctx->buffer, 0) at whisper.cpp:5131). This is by design for whisper.cpp's one-shot file processing use case, but means our streaming usage (calling it repeatedly with 512-sample chunks) loses temporal context between calls.

The upstream Silero VAD model is designed to be stateful — LSTM state should carry across 512-sample chunks, just like TEN-VAD carries state across 256-sample hops.

Proposed change

Since we fork whisper.cpp, add a non-breaking way to skip the buffer clear:

  • Option A: Add a bool reset_state parameter or flag to whisper_vad_detect_speech
  • Option B: Add a separate whisper_vad_reset_state() function and remove the auto-reset from detect_speech
  • Option C: Just remove the ggml_backend_buffer_clear line and let callers explicitly reset via a new function when needed

Also need to make VadBackend.reset() for Silero call the new reset function (currently a no-op).

Impact

  • No change to our calling code — SileroVad.chunkProbS16 already calls once per chunk
  • Probabilities should improve with temporal context
  • Likely improves Silero's accuracy on our regression tests
  • No upstream issue/PR exists for this (checked Feb 2026)

Priority

Low — TEN-VAD is our default and already stateful. This only affects --vad silero.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions