-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Environment
- livekit-agents: 1.3.11 (also confirmed on 1.3.12 plugin source)
- livekit-plugins-simli: 1.3.11
- livekit-plugins-hedra: 1.3.11
- Python: 3.11
- OS: macOS (LiveKit Cloud deployment)
- LLM: Google Gemini Native Audio (
gemini-2.5-flash-native-audio-preview-12-2025)
Description
Google Gemini Realtime (Native Audio) produces zero audio output when routed through the Simli avatar plugin's DataStreamAudioOutput. The same Gemini model works correctly with Hedra's DataStreamAudioOutput. The issue is Simli-specific, not a general DataStreamAudioOutput or Gemini problem.
Steps to Reproduce
- Create an
AgentSessionwithgoogle.beta.realtime.RealtimeModel(modalities=["AUDIO"]) - Create a
simli.AvatarSessionwith a validface_idandapi_key - Call
avatar_session.start(session, room), thensession.start(agent, room) - Call
session.generate_reply()or wait for user audio input - Observe: Gemini creates generations but produces zero
model_turncontent — noinline_data(audio), notext, nooutput_transcription
Expected Behavior
Gemini should produce audio output that flows through Simli's DataStreamAudioOutput to the Simli avatar worker, identical to how it works with Hedra.
Actual Behavior
Gemini acknowledges requests (generation lifecycle events fire) but produces empty generations:
// Response #1: generation_complete with no model_turn
{
"has_model_turn": false,
"turn_complete": null,
"generation_complete": true
}
// Response #2: turn_complete with no model_turn
{
"has_model_turn": false,
"turn_complete": true,
"generation_complete": null
}The audio routing chain is healthy — DataStreamAudioOutput._started=True, audio_enabled=True — but zero frames arrive because Gemini produces nothing upstream.
Root Cause Analysis
The Simli plugin is the only avatar plugin that omits wait_remote_track when creating DataStreamAudioOutput:
| Plugin | wait_remote_track |
Result with Gemini |
|---|---|---|
| Hedra | rtc.TrackKind.KIND_VIDEO |
Works |
| Simli | None (omitted) |
Fails |
| Bey | KIND_VIDEO |
Untested |
| Tavus | KIND_VIDEO |
Untested |
| Anam | KIND_VIDEO |
Untested |
| Avatario | KIND_VIDEO |
Untested |
| LemonSlice | KIND_VIDEO |
Untested |
Attempted fixes (all failed):
-
Adding
wait_remote_track=KIND_VIDEOafteravatar_session.start(): Replacing Simli'sDataStreamAudioOutputwith one that includeswait_remote_track=KIND_VIDEOcauses the session to deadlock — Simli's worker does not publish a video track that can be awaited. -
Updating plugin version: Simli plugin source is identical across 1.3.11, 1.3.12, and 1.4.0rc2 — no changes to
DataStreamAudioOutputconstruction. -
Adding startup delay: 3-second
asyncio.sleep()beforegenerate_reply()for Simli sessions — no effect.
Key Observations
- Simli is the only avatar plugin that omits
wait_remote_trackin itsDataStreamAudioOutputconstructor - Simli's worker does not publish a video track (adding
wait_remote_track=KIND_VIDEOdeadlocks) - OpenAI Realtime + Simli works — the issue is specific to Gemini + Simli
- Gemini + Hedra works — the issue is specific to Simli, not Gemini or
DataStreamAudioOutputin general - This may be related to LLM does not respond on second connection in LiveKit-Agent + Simli (Gemini Realtime/OpenAI) #3353 (Simli + LLM second connection failure), which also reports Simli-specific issues
Minimal Reproduction
from livekit.agents import Agent, AgentSession, JobContext, cli
from livekit.plugins import google, simli
async def entrypoint(ctx: JobContext):
await ctx.connect()
session = AgentSession(
llm=google.beta.realtime.RealtimeModel(
model="gemini-2.5-flash-native-audio-preview-12-2025",
modalities=["AUDIO"],
voice="Puck",
),
resume_false_interruption=False,
)
avatar = simli.AvatarSession(
simli_config=simli.SimliConfig(
api_key="YOUR_SIMLI_API_KEY",
face_id="YOUR_FACE_ID",
),
)
await avatar.start(session, ctx.room)
await session.start(
agent=Agent(instructions="Greet the user."),
room=ctx.room,
)
# This produces zero audio output with Simli.
# Replace simli with hedra.AvatarSession and it works.
await session.generate_reply(instructions="Say hello")Suggested Investigation
The architectural difference (no wait_remote_track, no video track publication) suggests Simli's DataStreamAudioOutput enters _started=True earlier than other plugins. This may cause a race condition where Gemini's realtime session sees the audio output as "ready" but Simli's downstream worker isn't actually prepared to receive audio, causing Gemini to silently produce empty generations. OpenAI's realtime model may be more resilient to this timing issue.