-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Summary
Add support for plugging in voice-based gender detection models into the LiveKit Agents pipeline. This would allow agents to infer speaker gender from audio input in real-time.
Motivation
Voice-based gender detection is essential for languages with grammatical gender (gender inflection/agreement). Many languages require verbs, adjectives, and participles to agree with the speaker's gender:
- Polish: "Zrobiłem" (male) vs "Zrobiłam" (female) - "I did"
- Russian: "Я сказал" (male) vs "Я сказала" (female) - "I said"
- German: "Ich bin gegangen" vs adjective endings based on gender
- French: "Je suis allé" (male) vs "Je suis allée" (female) - "I went"
- Spanish: "Estoy cansado" (male) vs "Estoy cansada" (female) - "I am tired"
- Italian, Portuguese, Hebrew, Arabic, and many others
For voice AI agents operating in these languages, using incorrect gender forms sounds unnatural and can be confusing or even disrespectful to users. Currently, agents have no way to detect the caller's gender from voice to generate grammatically correct responses.
Proposed Solution
Option 1: Built-in Audio Classification Node
Add a new optional audio_classification_node in the pipeline that runs in parallel with STT:
async def audio_classification_node(
self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
) -> Optional[AsyncIterable[GenderClassificationEvent]]:
# Returns gender classification events
...Option 2: Plugin System for Gender Classifiers
Create a plugin interface similar to STT/TTS plugins:
from livekit.plugins import gender_classifier
class VoiceGenderClassifier(gender_classifier.GenderClassifier):
async def classify(self, audio: AsyncIterable[rtc.AudioFrame]) -> GenderResult:
...Potential model integrations:
- Pyannote Audio - Speaker diarization with gender inference
- SpeechBrain - Open-source speech models
- Resemblyzer - Speaker embeddings
- Custom TensorFlow/PyTorch models
Option 3: Enhanced STT with Gender Metadata
Enhance the STT interface to optionally return gender metadata alongside transcription:
class SpeechEvent:
text: str
speaker_gender: Optional[str] # "male", "female", "unknown"
gender_confidence: Optional[float]Acceptance Criteria
- Ability to plug in custom gender detection models
- Access to detected gender within agent logic (for prompt construction, TTS voice selection)
- Support for real-time streaming classification
- Documentation and example implementation
Alternatives Considered
- Ask the user directly: Works but creates friction and feels unnatural in voice conversations
- Use neutral forms where possible: Not always grammatically correct or natural in gendered languages
- Custom STT node override: Currently possible but requires users to implement their own audio buffering and model integration
Additional Context
This feature would enable LiveKit Agents to properly serve users in languages with grammatical gender, which represents a significant portion of the world's languages and speakers.
Related documentation: