Fixes #4388: Correct transcription_delay metric calculation in STT turn detec… #4396
+163
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #4388
This PR fixes the incorrect
transcription_delaymetric calculation when using STT-based turn detection (e.g., Deepgram Flux).Problem
When using STT turn detection mode, the
transcription_delaymetric incorrectly shows ~0 seconds instead of reflecting the actual transcription latency.User-Reported Behavior:
The metric should measure the time between when the user stopped speaking and when the transcript was received, but it was always returning near-zero values.
Root Cause
In
audio_recognition.py, thetranscription_delayis calculated as:The bug was in the STT END_OF_SPEECH handler (line 452), which overwrote
_last_speaking_timewithtime.time():Event Timeline in STT Mode (Buggy):
_last_speaking_time = time.time()(correct)_last_final_transcript_time = time.time()(correct)_last_speaking_time = time.time()(BUG - overwrites!)Since END_OF_SPEECH typically arrives shortly after FINAL_TRANSCRIPT in STT mode, both timestamps become nearly identical, resulting in
transcription_delay ≈ 0.Solution
Remove the line that overwrites
_last_speaking_timeat END_OF_SPEECH in STT mode. The value was already correctly set at START_OF_SPEECH.Comparison with VAD Mode:
VAD mode does NOT update
_last_speaking_timeat END_OF_SPEECH - it keeps the value from the last INFERENCE_DONE event. STT mode should follow the same pattern.After Fix:
_last_speaking_time = time.time()(preserved)_last_final_transcript_time = time.time()Result:
transcription_delay = last_final_transcript_time - last_speaking_timenow correctly represents the actual transcription latency.Testing
All 15 existing agent session tests pass:
Backward Compatibility
No breaking changes - This fix only corrects the metric calculation. The actual agent behavior (speech recognition, turn detection, interruption handling) is completely unchanged.
Expected Impact:
transcription_delayvalues in their metricsEdge Cases Handled
_last_speaking_timefor each new segment_last_final_transcript_timecorrectlyFiles Changed
livekit-agents/livekit/agents/voice/audio_recognition.pyself._last_speaking_time = time.time()line from END_OF_SPEECH handlerRelated Issues