-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
whisper.cpp's whisper_vad_detect_speech resets LSTM hidden/cell states on every call (ggml_backend_buffer_clear(vctx->buffer, 0) at whisper.cpp:5131). This is by design for whisper.cpp's one-shot file processing use case, but means our streaming usage (calling it repeatedly with 512-sample chunks) loses temporal context between calls.
The upstream Silero VAD model is designed to be stateful — LSTM state should carry across 512-sample chunks, just like TEN-VAD carries state across 256-sample hops.
Proposed change
Since we fork whisper.cpp, add a non-breaking way to skip the buffer clear:
- Option A: Add a
bool reset_stateparameter or flag towhisper_vad_detect_speech - Option B: Add a separate
whisper_vad_reset_state()function and remove the auto-reset fromdetect_speech - Option C: Just remove the
ggml_backend_buffer_clearline and let callers explicitly reset via a new function when needed
Also need to make VadBackend.reset() for Silero call the new reset function (currently a no-op).
Impact
- No change to our calling code —
SileroVad.chunkProbS16already calls once per chunk - Probabilities should improve with temporal context
- Likely improves Silero's accuracy on our regression tests
- No upstream issue/PR exists for this (checked Feb 2026)
Priority
Low — TEN-VAD is our default and already stateful. This only affects --vad silero.