Skip to content

Gemini Realtime produces zero audio output when used with Simli avatar plugin (works with Hedra) #4648

@Adrian-Chy

Description

@Adrian-Chy

Environment

  • livekit-agents: 1.3.11 (also confirmed on 1.3.12 plugin source)
  • livekit-plugins-simli: 1.3.11
  • livekit-plugins-hedra: 1.3.11
  • Python: 3.11
  • OS: macOS (LiveKit Cloud deployment)
  • LLM: Google Gemini Native Audio (gemini-2.5-flash-native-audio-preview-12-2025)

Description

Google Gemini Realtime (Native Audio) produces zero audio output when routed through the Simli avatar plugin's DataStreamAudioOutput. The same Gemini model works correctly with Hedra's DataStreamAudioOutput. The issue is Simli-specific, not a general DataStreamAudioOutput or Gemini problem.

Steps to Reproduce

  1. Create an AgentSession with google.beta.realtime.RealtimeModel(modalities=["AUDIO"])
  2. Create a simli.AvatarSession with a valid face_id and api_key
  3. Call avatar_session.start(session, room), then session.start(agent, room)
  4. Call session.generate_reply() or wait for user audio input
  5. Observe: Gemini creates generations but produces zero model_turn content — no inline_data (audio), no text, no output_transcription

Expected Behavior

Gemini should produce audio output that flows through Simli's DataStreamAudioOutput to the Simli avatar worker, identical to how it works with Hedra.

Actual Behavior

Gemini acknowledges requests (generation lifecycle events fire) but produces empty generations:

// Response #1: generation_complete with no model_turn
{
  "has_model_turn": false,
  "turn_complete": null,
  "generation_complete": true
}

// Response #2: turn_complete with no model_turn
{
  "has_model_turn": false,
  "turn_complete": true,
  "generation_complete": null
}

The audio routing chain is healthy — DataStreamAudioOutput._started=True, audio_enabled=True — but zero frames arrive because Gemini produces nothing upstream.

Root Cause Analysis

The Simli plugin is the only avatar plugin that omits wait_remote_track when creating DataStreamAudioOutput:

Plugin wait_remote_track Result with Gemini
Hedra rtc.TrackKind.KIND_VIDEO Works
Simli None (omitted) Fails
Bey KIND_VIDEO Untested
Tavus KIND_VIDEO Untested
Anam KIND_VIDEO Untested
Avatario KIND_VIDEO Untested
LemonSlice KIND_VIDEO Untested

Attempted fixes (all failed):

  1. Adding wait_remote_track=KIND_VIDEO after avatar_session.start(): Replacing Simli's DataStreamAudioOutput with one that includes wait_remote_track=KIND_VIDEO causes the session to deadlock — Simli's worker does not publish a video track that can be awaited.

  2. Updating plugin version: Simli plugin source is identical across 1.3.11, 1.3.12, and 1.4.0rc2 — no changes to DataStreamAudioOutput construction.

  3. Adding startup delay: 3-second asyncio.sleep() before generate_reply() for Simli sessions — no effect.

Key Observations

  • Simli is the only avatar plugin that omits wait_remote_track in its DataStreamAudioOutput constructor
  • Simli's worker does not publish a video track (adding wait_remote_track=KIND_VIDEO deadlocks)
  • OpenAI Realtime + Simli works — the issue is specific to Gemini + Simli
  • Gemini + Hedra works — the issue is specific to Simli, not Gemini or DataStreamAudioOutput in general
  • This may be related to LLM does not respond on second connection in LiveKit-Agent + Simli (Gemini Realtime/OpenAI) #3353 (Simli + LLM second connection failure), which also reports Simli-specific issues

Minimal Reproduction

from livekit.agents import Agent, AgentSession, JobContext, cli
from livekit.plugins import google, simli

async def entrypoint(ctx: JobContext):
    await ctx.connect()

    session = AgentSession(
        llm=google.beta.realtime.RealtimeModel(
            model="gemini-2.5-flash-native-audio-preview-12-2025",
            modalities=["AUDIO"],
            voice="Puck",
        ),
        resume_false_interruption=False,
    )

    avatar = simli.AvatarSession(
        simli_config=simli.SimliConfig(
            api_key="YOUR_SIMLI_API_KEY",
            face_id="YOUR_FACE_ID",
        ),
    )
    await avatar.start(session, ctx.room)

    await session.start(
        agent=Agent(instructions="Greet the user."),
        room=ctx.room,
    )

    # This produces zero audio output with Simli.
    # Replace simli with hedra.AvatarSession and it works.
    await session.generate_reply(instructions="Say hello")

Suggested Investigation

The architectural difference (no wait_remote_track, no video track publication) suggests Simli's DataStreamAudioOutput enters _started=True earlier than other plugins. This may cause a race condition where Gemini's realtime session sees the audio output as "ready" but Simli's downstream worker isn't actually prepared to receive audio, causing Gemini to silently produce empty generations. OpenAI's realtime model may be more resilient to this timing issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions