Description
Hello! I'm trying to hook up a Twilio media stream to an Agent with the voice pipeline.
My process is more or less the following:
- I receive the Twilio call through the websockets and start processing the events
- I transcode the Twilio audio from 8kHz mu-Law to the expected 24kHz Mono PCM
- I add the audio to the instance of StreamedAudioInput
- I listen to the events from the pipeline run
- I transcode back the audio from OpenAI to Twilio
Now, what I have is a "working" system where I can accept a call, the audio gets processed properly and sent to OpenAI, and get a response back that I'm able to hear through the phone call.
The issue I have is that the audio transcript is just garbled (I don't mean the audio itself, which if I listen to it, it's clear enough, and I'm able to transcribe it just fine with a standard transcription call). It's the audio transcript which is just nothing at all to what is being said.
Here's two examples:
- Here I'm just saying "test" but the transcription is "Hi."

This is the audio file: https://filebin.net/3do528busqpnegro/span_239e5f3eb17349dfa9fc64ec-input.wav
- Here I'm saying "probando" (in Spanish, "testing"), and it comes out as two Chinese characters (看看)

This is the audio file: https://filebin.net/3do528busqpnegro/span_9577c4b6fd674cb794081ada-input.wav
I cannot figure out what's wrong, I have a feeling it's got to do with the audio processed for some reason.
This is how I'm setting up the pipeline:
pipeline = VoicePipeline(
workflow=SingleAgentVoiceWorkflow(
agent,
),
config=VoicePipelineConfig(
model_provider=OpenAIVoiceModelProvider(
api_key=OPENAI_API_KEY,
),
workflow_name="Agent",
stt_settings=STTModelSettings(
turn_detection={"type": "semantic_vad", "eagerness": "low"},
),
),
)
This is the processed code:
import numpy as np
import audioop
import soxr
def mulaw_to_openai_pcm(mulaw_bytes: bytes) -> np.ndarray:
pcm = audioop.ulaw2lin(mulaw_bytes, 2)
audio_np = np.frombuffer(pcm, dtype=np.int16)
audio_24k = soxr.resample(audio_np, 8000, 24000)
# Convert to float32 range [-1.0, 1.0] as expected by OpenAI
return (audio_24k / 32768.0).astype(np.float32)
def openai_audio_to_twilio_mulaw(audio_data: np.ndarray) -> bytes:
# Normalize dtype
if audio_data.dtype == np.int16:
audio_data = audio_data.astype(np.float32) / 32768.0
elif audio_data.dtype != np.float32:
raise ValueError(f"Unsupported dtype: {audio_data.dtype}")
# Resample from 24kHz → 8kHz
resampled = soxr.resample(audio_data, 24000, 8000)
# Convert to int16
resampled_int16 = np.clip(resampled * 32768.0, -32768, 32767).astype(np.int16)
# μ-law encode
return audioop.lin2ulaw(resampled_int16.tobytes(), 2)
PD: I'm adding the processed Twilio audio to the StreamedAudioInput as soon as I receive it, maybe it's got to do with that?