Voice abstraction layer for AgentPlexus supporting TTS, STT, and Voice Agents across multiple providers and transport protocols.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OmniVoice β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββββββββββββββ β
β β TTS β β STT β β Voice Agent β β
β β β β β β β β
β β Text β Audioβ β Audio β Textβ β Real-time bidirectional voice β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ βββββββββββββββββ¬ββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Provider Layer β β
β βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ€ β
β β ElevenLabs β Deepgram β Google Cloudβ AWS β Azure β β
β β Cartesia β Whisper β AssemblyAI β Polly β Speech β β
β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Transport Layer β β
β βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ€ β
β β WebRTC β SIP β PSTN β WebSocket β HTTP β β
β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Call System Integration β β
β βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ€ β
β β Twilio β RingCentral β Zoom β LiveKit β Daily β β
β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
omnivoice/
βββ tts/ # Text-to-Speech
β βββ tts.go # Interface definitions
β βββ elevenlabs/ # ElevenLabs provider
β βββ polly/ # AWS Polly provider
β βββ google/ # Google Cloud TTS
β βββ azure/ # Azure Speech
β βββ cartesia/ # Cartesia provider
β
βββ stt/ # Speech-to-Text
β βββ stt.go # Interface definitions
β βββ whisper/ # OpenAI Whisper
β βββ deepgram/ # Deepgram provider
β βββ google/ # Google Speech-to-Text
β βββ azure/ # Azure Speech
β βββ assemblyai/ # AssemblyAI provider
β
βββ agent/ # Voice Agent orchestration
β βββ agent.go # Interface definitions
β βββ session.go # Conversation session management
β βββ elevenlabs/ # ElevenLabs Agents
β βββ vapi/ # Vapi.ai
β βββ retell/ # Retell AI
β βββ custom/ # Custom agent (STT + LLM + TTS)
β
βββ transport/ # Audio transport protocols
β βββ transport.go # Interface definitions
β βββ webrtc/ # WebRTC transport
β βββ websocket/ # WebSocket streaming
β βββ sip/ # SIP protocol
β βββ http/ # HTTP-based (batch)
β
βββ callsystem/ # Call system integrations
β βββ callsystem.go # Interface definitions
β βββ twilio/ # Twilio ConversationRelay
β βββ ringcentral/ # RingCentral Voice API
β βββ zoom/ # Zoom SDK integration
β βββ livekit/ # LiveKit rooms
β βββ daily/ # Daily.co
β
βββ subtitle/ # Subtitle generation
β βββ subtitle.go # SRT/VTT from transcription results
β
βββ examples/
βββ simple-tts/ # Basic TTS example
βββ voice-agent/ # Voice agent with Twilio
βββ multi-provider/ # Provider fallback example
Voice AI agents need a transport layer to receive and send audio. The choice depends on the use case:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Call System Options β
ββββββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββββββββ€
β Platform β Protocol β Best For β Complexity β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββΌβββββββββββββββββββββ€
β Twilio β WebRTC/SIP/ β Phone calls, β Medium - managed β
β Conversation- β PSTN β IVR, call β infrastructure β
β Relay β β centers β β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββΌβββββββββββββββββββββ€
β RingCentral β WebRTC/SIP β Enterprise PBX, β Medium - native β
β Voice API β β business phones β AI receptionist β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββΌβββββββββββββββββββββ€
β Zoom SDK β Proprietary β Video meetings β High - requires β
β β (via SDK) β with voice bots β native SDK β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββΌβββββββββββββββββββββ€
β LiveKit β WebRTC β Custom apps, β Low - open source β
β β β real-time AI β WebRTC rooms β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββΌβββββββββββββββββββββ€
β Daily.co β WebRTC β Embedded video, β Low - simple API β
β β β browser-based β β
ββββββββββββββββββΌββββββββββββββββΌββββββββββββββββββΌβββββββββββββββββββββ€
β WebSocket β WS/WSS β Web apps, β Low - direct β
β (Direct) β β custom UIs β streaming β
ββββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββββ΄βββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PSTN/WebRTC Call Flow β
β β
β βββββββββββ βββββββββββββββ βββββββββββββββββββββββββββββ β
β β User ββββββββββΊβ Twilio βββββββββββΊβ OmniVoice β β
β β (Phone) β PSTN β Conversationβ WebSocketβ β β
β β β β Relay β β βββββββββββββββββββββββ β β
β βββββββββββ βββββββββββββββ β β Voice Agent β β β
β β β β β β
β β β βββββββββ β β β
β Audio In ββββββββββββββΊβ β β STT ββββ β β β
β β β βββββββββ β β β β
β β β βΌ β β β
β β β βββββββββββββββββ β β β
β β β β LLM / Agent β β β β
β β β β (Eino, etc.) β β β β
β β β βββββββββββββββββ β β β
β β β β β β β
β β β βΌ β β β
β β β βββββββββ β β β
β Audio Out ββββββββββββββ β β TTS ββββ β β β
β β β βββββββββ β β β
β β βββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Zoom Meeting Flow β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Zoom Meeting β β
β β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββββββββββββββ β β
β β β User 1 β β User 2 β β User 3 β β Bot Client β β β
β β β (Human) β β (Human) β β (Human) β β (Zoom SDK) β β β
β β βββββββββββ βββββββββββ βββββββββββ ββββββββββββ¬βββββββββββ β β
β β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββ β
β β β
β Raw Audio Stream β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β OmniVoice Agent β β
β β β β
β β Option A: Use Recall.ai (recommended) β β
β β βββββββββββββββ β β
β β β Recall.ai ββββΊ Handles Zoom SDK complexity β β
β β β Bot ββββΊ Provides audio stream via WebSocket β β
β β βββββββββββββββ β β
β β β β
β β Option B: Self-hosted Zoom SDK Bot β β
β β βββββββββββββββ β β
β β β Zoom Linux ββββΊ Complex: requires native SDK β β
β β β SDK Bot ββββΊ One instance per meeting β β
β β ββββββββββββββββββΊ Months of engineering β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Use Case | Call System | Transport | Notes |
|---|---|---|---|
| IVR / Call Center | Twilio ConversationRelay | PSTN/SIP | Best managed solution |
| Business Phone | RingCentral | WebRTC/SIP | Native AI Receptionist available |
| Custom Web App | LiveKit or Daily | WebRTC | Open source, flexible |
| Zoom Meetings | Recall.ai + Zoom | SDK β WebSocket | Avoid building Zoom bot yourself |
| Browser Widget | Direct WebSocket | WebSocket | ElevenLabs widget or custom |
| Mobile App | LiveKit | WebRTC | Cross-platform support |
For natural conversation, total round-trip latency should be under 500ms:
User speaks β STT (100-300ms) β LLM (200-500ms) β TTS (100-200ms) β User hears
Target: < 500ms total for "instant" feel
Acceptable: < 1000ms for natural conversation
Poor: > 1500ms feels laggy
- Streaming STT: Start processing before user finishes speaking
- Streaming TTS: Start playing audio before full response generated
- Edge inference: Use providers with edge nodes (Deepgram, ElevenLabs)
- Turn detection: Use voice activity detection (VAD) for quick turn-taking
| Provider | Latency | Quality | Voices | Streaming | Price |
|---|---|---|---|---|---|
| ElevenLabs | Low | Excellent | 5000+ | Yes | $$$ |
| Cartesia | Very Low | Good | 100+ | Yes | $$ |
| AWS Polly | Low | Good | 60+ | Yes | $ |
| Google TTS | Low | Good | 200+ | Yes | $ |
| Azure Speech | Low | Excellent | 400+ | Yes | $$ |
| Provider | Latency | Accuracy | Streaming | Languages | Price |
|---|---|---|---|---|---|
| Deepgram | Very Low | Excellent | Yes | 30+ | $$ |
| Whisper (OpenAI) | Medium | Excellent | No* | 50+ | $ |
| Google Speech | Low | Excellent | Yes | 125+ | $$ |
| AssemblyAI | Low | Excellent | Yes | 20+ | $$ |
| Azure Speech | Low | Excellent | Yes | 100+ | $$ |
*Whisper requires self-hosting for streaming (e.g., faster-whisper)
| Provider | Customization | Latency | Telephony | Price |
|---|---|---|---|---|
| ElevenLabs Agents | Medium | Low | Via Twilio | $$$ |
| Vapi | High | Low | Built-in | $$ |
| Retell AI | High | Low | Built-in | $$ |
| Custom (OmniVoice) | Full | Variable | Via integration | Variable |
OmniVoice includes conformance test suites that provider implementations can use to verify they correctly implement the TTS and STT interfaces with consistent behavior.
Provider implementations should import the providertest packages and run the conformance tests:
// In your provider's conformance_test.go
import (
"github.com/agentplexus/omnivoice/stt/providertest"
// or for TTS:
// "github.com/agentplexus/omnivoice/tts/providertest"
)
func TestConformance(t *testing.T) {
p, err := New(WithAPIKey(apiKey))
if err != nil {
t.Fatal(err)
}
providertest.RunAll(t, providertest.Config{
Provider: p,
TestAudioFile: "/path/to/test.mp3",
TestAudioURL: "https://example.com/test.mp3",
// ...
})
}| Category | Description | API Required |
|---|---|---|
| Interface | Verify provider implements interface contract (Name, etc.) | No |
| Behavior | Verify edge case handling (empty input, context cancellation) | Sometimes |
| Integration | Verify actual synthesis/transcription works | Yes |
| Test | Description |
|---|---|
Transcribe |
Batch transcription from audio bytes |
TranscribeFile |
Batch transcription from local file path |
TranscribeURL |
Batch transcription from remote URL |
TranscribeStream |
Real-time streaming transcription |
| Test | Description |
|---|---|
Synthesize |
Returns valid audio bytes |
SynthesizeStream |
Streams audio chunks |
SynthesizeFromReader |
Handles streaming text input |
See Provider Conformance Testing TRD for detailed design documentation.
- Twilio ConversationRelay
- RingCentral Voice API
- LiveKit Voice AI
- Daily.co
- Recall.ai - Meeting bot infrastructure