Open
Description
Streamed Voice Agent Demo - Multiple Performance Issues
Description
The streamed voice agent demo is experiencing several critical issues that affect its usability and functionality:
- High Latency: There is a significant delay (3-4 seconds) before receiving responses.
- Language Switching: The agent randomly switches to Spanish during conversations.
- Over-sensitivity: The agent frequently detects speech and provides incorrect descriptions even when no one is speaking.
- Interruption Issues: The agent cannot be interrupted despite
semantic_vad
apparently being implemented in the code.
Steps to Reproduce
- Launch the streamed voice agent demo
- Attempt to engage in conversation with the agent
- Observe the delay between speaking and receiving a response
- Continue conversation for several exchanges to observe language switching
- Remain silent for periods to observe false speech detection
- Try to interrupt the agent while it's speaking
Expected Behavior
- Responses should begin within 1 second of user input
- The agent should maintain the initially selected language throughout the conversation
- Speech detection should only activate when actual speech is present
- The
semantic_vad
feature should allow interruption of the agent's responses
Actual Behavior
- Responses take 3-4 seconds to begin after user input
- The agent randomly switches to Spanish during English conversations
- The agent frequently reports detecting speech and provides descriptions when no one is speaking
- The agent cannot be interrupted despite the apparent implementation of
semantic_vad
Technical Details
From code inspection, semantic_vad
appears to be implemented but is not functioning as expected. This suggests a potential issue with how the feature is integrated or configured in the current build.
Additional Notes
These issues significantly impact the user experience and demonstration value of the agent. The latency and language switching problems are particularly disruptive during presentations.
Possible Solutions
- Investigate streaming optimization to reduce latency
- Check language model configuration for potential causes of language switching
- Adjust speech detection sensitivity parameters
- Review
semantic_vad
implementation to ensure proper configuration
Priority
High - These issues prevent effective demonstration of the voice agent's capabilities.