Skip to content

Conversation

@devbyteai
Copy link

Summary

Fixes #4388

This PR fixes the incorrect transcription_delay metric calculation when using STT-based turn detection (e.g., Deepgram Flux).


Problem

When using STT turn detection mode, the transcription_delay metric incorrectly shows ~0 seconds instead of reflecting the actual transcription latency.

User-Reported Behavior:

"EOU metrics showing ~0.79 transcription_delay when should reflect actual processing time"

The metric should measure the time between when the user stopped speaking and when the transcript was received, but it was always returning near-zero values.


Root Cause

In audio_recognition.py, the transcription_delay is calculated as:

transcription_delay = max(last_final_transcript_time - last_speaking_time, 0)

The bug was in the STT END_OF_SPEECH handler (line 452), which overwrote _last_speaking_time with time.time():

elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    self._last_speaking_time = time.time()  # BUG: Overwrites the value!

Event Timeline in STT Mode (Buggy):

  1. START_OF_SPEECH → _last_speaking_time = time.time() (correct)
  2. FINAL_TRANSCRIPT → _last_final_transcript_time = time.time() (correct)
  3. END_OF_SPEECH → _last_speaking_time = time.time() (BUG - overwrites!)

Since END_OF_SPEECH typically arrives shortly after FINAL_TRANSCRIPT in STT mode, both timestamps become nearly identical, resulting in transcription_delay ≈ 0.


Solution

Remove the line that overwrites _last_speaking_time at END_OF_SPEECH in STT mode. The value was already correctly set at START_OF_SPEECH.

Comparison with VAD Mode:
VAD mode does NOT update _last_speaking_time at END_OF_SPEECH - it keeps the value from the last INFERENCE_DONE event. STT mode should follow the same pattern.

After Fix:

  1. START_OF_SPEECH → _last_speaking_time = time.time() (preserved)
  2. FINAL_TRANSCRIPT → _last_final_transcript_time = time.time()
  3. END_OF_SPEECH → No overwrite

Result: transcription_delay = last_final_transcript_time - last_speaking_time now correctly represents the actual transcription latency.


Testing

All 15 existing agent session tests pass:

tests/test_agent_session.py::test_events_and_metrics PASSED
tests/test_agent_session.py::test_tool_call PASSED
tests/test_agent_session.py::test_interruption[False-5.5] PASSED
tests/test_agent_session.py::test_interruption[True-5.5] PASSED
tests/test_agent_session.py::test_interruption_options PASSED
tests/test_agent_session.py::test_interruption_by_text_input PASSED
tests/test_agent_session.py::test_interruption_before_speaking[False-3.5] PASSED
tests/test_agent_session.py::test_interruption_before_speaking[True-3.5] PASSED
tests/test_agent_session.py::test_generate_reply PASSED
tests/test_agent_session.py::test_preemptive_generation[True-0.8] PASSED
tests/test_agent_session.py::test_preemptive_generation[False-1.1] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[False-0.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[False-2.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[True-0.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[True-2.0] PASSED

======================== 15 passed in 75.96s ========================

Backward Compatibility

No breaking changes - This fix only corrects the metric calculation. The actual agent behavior (speech recognition, turn detection, interruption handling) is completely unchanged.

Expected Impact:

  • Users with STT turn detection will now see accurate transcription_delay values in their metrics
  • Dashboards showing this metric will now report correct latency (previously under-reported as ~0)

Edge Cases Handled

  1. No VAD present - Already handled at lines 376-382, falls back to STT timestamps
  2. Multiple speech segments - START_OF_SPEECH updates _last_speaking_time for each new segment
  3. Preflight transcripts - Also update _last_final_transcript_time correctly
  4. VAD mode unchanged - Fix only affects STT turn detection mode

Files Changed

livekit-agents/livekit/agents/voice/audio_recognition.py

  • Removed the buggy self._last_speaking_time = time.time() line from END_OF_SPEECH handler
  • Added explanatory comment documenting why we don't update the timestamp here

Related Issues

…tion mode

Fixes livekit#4388

Remove the line that overwrites _last_speaking_time at END_OF_SPEECH in STT mode.
This was causing transcription_delay to always be ~0 since END_OF_SPEECH typically
arrives after FINAL_TRANSCRIPT, making both timestamps nearly identical.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect transcription_delay when using STT turn detection mode

1 participant