-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug Description
In 1.3.7, endpointing waits until last_speaking_time + min_endpointing_delay.
- VAD mode: last_speaking_time is set during speech (VAD inference), so by VAD END_OF_SPEECH the delay may already
be satisfied → effective delay ≈ max(VAD_min_silence, min_endpointing_delay). - STT mode: last_speaking_time is set at STT END_OF_SPEECH, so the full min_endpointing_delay is always added after
STT EOS → additive.
Docs imply a consistent “after EOS” delay, but behavior is mode‑dependent and EOU metrics in STT mode exclude STT
silence which is perceived by user.
Expected Behavior
Either:
- Consistent “after end‑of‑utterance signal” delay in all modes, or
- Docs/metrics clearly state VAD = max(), STT = additive.
Reproduction Steps
1. VAD mode: set VAD min_silence=2s, min_endpointing_delay=2s. Observe commit ≈ 2s after user stop, not 4s.
2. STT mode (with providers like AssemblyAI): set STT EOS silence=2s and min_endpointing_delay=2s. Observe total ≈ 4s (STT silence + endpointing), while EOU metric end_of_utterance_delay shows ~2s.Operating System
Linux
Models Used
No response
Package Versions
livekit-agents=1.3.7Session/Room/Call IDs
No response
Proposed Solution
- Update docs to explain mode‑dependent behavior.
- Or set last_speaking_time at EOS in VAD mode for consistent additive behavior.
- Add a metric for “silence‑to‑commit” including STT EOS delay.Additional Context
No response
Screenshots and Recordings
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working