Issue when processing real time audio from a Twilio media stream

Hello! I'm trying to hook up a Twilio media stream to an Agent with the voice pipeline.

My process is more or less the following:

1. I receive the Twilio call through the websockets and start processing the events
2. I transcode the Twilio audio from 8kHz mu-Law to the expected 24kHz Mono PCM
3. I add the audio to the instance of StreamedAudioInput
4. I listen to the events from the pipeline run
5. I transcode back the audio from OpenAI to Twilio

Now, what I have is a "working" system where I can accept a call, the audio gets processed properly and sent to OpenAI, and get a response back that I'm able to hear through the phone call.

The issue I have is that the audio transcript is just garbled (I don't mean the audio itself, which if I listen to it, it's clear enough, and I'm able to transcribe it just fine with a standard transcription call). It's the audio transcript which is just nothing at all to what is being said.

Here's two examples:

- Here I'm just saying "test" but the transcription is "Hi."

<img width="503" alt="Image" src="https://github.com/user-attachments/assets/644302da-7a0a-4702-b6b1-549e84d40b3d" />

This is the audio file: https://filebin.net/3do528busqpnegro/span_239e5f3eb17349dfa9fc64ec-input.wav

- Here I'm saying "probando" (in Spanish, "testing"), and it comes out as two Chinese characters (看看)

<img width="501" alt="Image" src="https://github.com/user-attachments/assets/54070c06-9dd1-439b-8823-608260ac970f" />

This is the audio file: https://filebin.net/3do528busqpnegro/span_9577c4b6fd674cb794081ada-input.wav

I cannot figure out what's wrong, I have a feeling it's got to do with the audio processed for some reason.

This is how I'm setting up the pipeline:

```python
pipeline = VoicePipeline(
    workflow=SingleAgentVoiceWorkflow(
        agent,
    ),
    config=VoicePipelineConfig(
        model_provider=OpenAIVoiceModelProvider(
            api_key=OPENAI_API_KEY,
        ),
        workflow_name="Agent",
        stt_settings=STTModelSettings(
            turn_detection={"type": "semantic_vad", "eagerness": "low"},
        ),
    ),
)
```

This is the processed code:
```python
import numpy as np
import audioop
import soxr


def mulaw_to_openai_pcm(mulaw_bytes: bytes) -> np.ndarray:
    pcm = audioop.ulaw2lin(mulaw_bytes, 2)
    audio_np = np.frombuffer(pcm, dtype=np.int16)

    audio_24k = soxr.resample(audio_np, 8000, 24000)

    # Convert to float32 range [-1.0, 1.0] as expected by OpenAI
    return (audio_24k / 32768.0).astype(np.float32)


def openai_audio_to_twilio_mulaw(audio_data: np.ndarray) -> bytes:
    # Normalize dtype
    if audio_data.dtype == np.int16:
        audio_data = audio_data.astype(np.float32) / 32768.0
    elif audio_data.dtype != np.float32:
        raise ValueError(f"Unsupported dtype: {audio_data.dtype}")

    # Resample from 24kHz → 8kHz
    resampled = soxr.resample(audio_data, 24000, 8000)

    # Convert to int16
    resampled_int16 = np.clip(resampled * 32768.0, -32768, 32767).astype(np.int16)

    # μ-law encode
    return audioop.lin2ulaw(resampled_int16.tobytes(), 2)

```

PD: I'm adding the processed Twilio audio to the StreamedAudioInput as soon as I receive it, maybe it's got to do with that?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue when processing real time audio from a Twilio media stream #304

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue when processing real time audio from a Twilio media stream #304

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions