Skip to content

Issue when processing real time audio from a Twilio media stream #304

Closed as not planned
@Pablo-Merino

Description

@Pablo-Merino

Hello! I'm trying to hook up a Twilio media stream to an Agent with the voice pipeline.

My process is more or less the following:

  1. I receive the Twilio call through the websockets and start processing the events
  2. I transcode the Twilio audio from 8kHz mu-Law to the expected 24kHz Mono PCM
  3. I add the audio to the instance of StreamedAudioInput
  4. I listen to the events from the pipeline run
  5. I transcode back the audio from OpenAI to Twilio

Now, what I have is a "working" system where I can accept a call, the audio gets processed properly and sent to OpenAI, and get a response back that I'm able to hear through the phone call.

The issue I have is that the audio transcript is just garbled (I don't mean the audio itself, which if I listen to it, it's clear enough, and I'm able to transcribe it just fine with a standard transcription call). It's the audio transcript which is just nothing at all to what is being said.

Here's two examples:

  • Here I'm just saying "test" but the transcription is "Hi."
Image

This is the audio file: https://filebin.net/3do528busqpnegro/span_239e5f3eb17349dfa9fc64ec-input.wav

  • Here I'm saying "probando" (in Spanish, "testing"), and it comes out as two Chinese characters (看看)
Image

This is the audio file: https://filebin.net/3do528busqpnegro/span_9577c4b6fd674cb794081ada-input.wav

I cannot figure out what's wrong, I have a feeling it's got to do with the audio processed for some reason.

This is how I'm setting up the pipeline:

pipeline = VoicePipeline(
    workflow=SingleAgentVoiceWorkflow(
        agent,
    ),
    config=VoicePipelineConfig(
        model_provider=OpenAIVoiceModelProvider(
            api_key=OPENAI_API_KEY,
        ),
        workflow_name="Agent",
        stt_settings=STTModelSettings(
            turn_detection={"type": "semantic_vad", "eagerness": "low"},
        ),
    ),
)

This is the processed code:

import numpy as np
import audioop
import soxr


def mulaw_to_openai_pcm(mulaw_bytes: bytes) -> np.ndarray:
    pcm = audioop.ulaw2lin(mulaw_bytes, 2)
    audio_np = np.frombuffer(pcm, dtype=np.int16)

    audio_24k = soxr.resample(audio_np, 8000, 24000)

    # Convert to float32 range [-1.0, 1.0] as expected by OpenAI
    return (audio_24k / 32768.0).astype(np.float32)


def openai_audio_to_twilio_mulaw(audio_data: np.ndarray) -> bytes:
    # Normalize dtype
    if audio_data.dtype == np.int16:
        audio_data = audio_data.astype(np.float32) / 32768.0
    elif audio_data.dtype != np.float32:
        raise ValueError(f"Unsupported dtype: {audio_data.dtype}")

    # Resample from 24kHz → 8kHz
    resampled = soxr.resample(audio_data, 24000, 8000)

    # Convert to int16
    resampled_int16 = np.clip(resampled * 32768.0, -32768, 32767).astype(np.int16)

    # μ-law encode
    return audioop.lin2ulaw(resampled_int16.tobytes(), 2)

PD: I'm adding the processed Twilio audio to the StreamedAudioInput as soon as I receive it, maybe it's got to do with that?

Metadata

Metadata

Labels

questionQuestion about using the SDKstale

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions