server : support audio input #13714

ngxson · 2025-05-22T20:25:23Z

Pre-quantized models (recommended 8B model, it has much better quality than the 1B):

llama-server -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF
llama-server -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF

Try via web UI, summarize this fireship video

OAI-compat API:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "transcribe this audio"
        },
        {
          "type": "input_audio",
          "input_audio": {
            "data": ".....(base64 encoded audio data).....",
            "format": "mp3"
          }
        }
      ]
    }
  ]
}

ddiddi · 2025-05-22T20:59:08Z

LGTM!

AeneasZhu · 2025-05-23T12:05:57Z

Excuse me. My device couldn't launch llama-server when load ultravox. Here is my command:
./llama-server -m models/Llama-31-8B.gguf --mmproj models/mmproj/mmproj-ultravox-v0_5-llama-3_1.gguf -ngl 33 --port 8090

And it terminated and showed:

Any solution please?

kth8 · 2025-05-23T12:57:08Z

I compiled the server myself and it started without problem although when I tried to upload a 35 minute mp3 I got this error

so I switched to this script for the server to process the ~13.2k tokens for this mp3

# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "openai",
#
# ]
# ///
from base64 import b64encode
from openai import OpenAI

image_path = "shortstory034_aladdinandthemagiclamp_llf_64kb.mp3"

with open(image_path, "rb") as file:
    audio_base64 = b64encode(file.read()).decode('utf-8')

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="llamacpp")

completion = client.chat.completions.create(
    model="ultravox",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Summarize this story."},
                {"type": "input_audio", "input_audio": {"data": audio_base64, "format": "mp3"}}
            ]
        }
    ],
    stream=True
)

for chunk in completion:
    if chunk.choices and chunk.choices[0].delta:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end='', flush=True)
print()

ngxson · 2025-05-23T13:14:04Z

@AeneasZhu can you try --no-mmproj-offload to see it if works? It will run the audio encoder on cpu. Also add --verbose to give more verbose log

@kth8 I'm doubt that model is trained with audio that long, but we can remove the 10MB restriction on frontend if needed

jvmx · 2025-05-23T14:14:22Z

awesome - is there a way to use this to transcribe a live stream of audio or does it have to be a complete pre-made audio file?

ngxson · 2025-05-23T14:25:35Z

These models are text-audio-to-text, not ASR, so I don't think they are trained or optimized for streamed real-time transcription

kth8 · 2025-05-23T14:45:43Z

@kth8 I'm doubt that model is trained with audio that long, but we can remove the 10MB restriction on frontend if needed

oh, what is the practical limit of this model? The model card didn't mention any best practices. Does it work best if I limit the audio length to ~5 mins like in your example?

sinand99 · 2025-05-23T16:52:20Z

Excuse me. My device couldn't launch llama-server when load ultravox. Here is my command: ./llama-server -m models/Llama-31-8B.gguf --mmproj models/mmproj/mmproj-ultravox-v0_5-llama-3_1.gguf -ngl 33 --port 8090

And it terminated and showed:

Any solution please?

I have exactly the same problem with the same GPU. I created a new issue.

ngxson · 2025-05-23T17:20:00Z

@kth8 there is no clear limit for the model.

The 10MB limit is frontend-only, it's there becasue we don't want user to accidentally upload multi-gigabytes file which will crash the web page. But we can remove it anyway

robinnarsinghranabhat · 2025-05-23T21:13:13Z

@ngxson have been following you for a while from your smol-vlm release.

What does it take to support Qwen2-Audio Instruct ? it seems to have similar interface like above model. default huggingface-python is too slow for my usecase.

I wanted to understand and contribute on llama.cpp by taking this on.
any helpful set of PRs I could refer to for this ? or guidance is highly appreciate !

Just an open source lover wanting to help !!

server : support audio input

e55a682

ngxson requested a review from ggerganov May 22, 2025 20:25

github-actions bot added examples server labels May 22, 2025

add audio support on webui

0d36c6b

github-actions bot added the python python script changes label May 22, 2025

ggerganov approved these changes May 23, 2025

View reviewed changes

ngxson merged commit 9ecf3e6 into ggml-org:master May 23, 2025
48 checks passed

mudler mentioned this pull request May 26, 2025

feat(llama.cpp): add support for audio input mudler/LocalAI#5466

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : support audio input #13714

server : support audio input #13714

Uh oh!

ngxson commented May 22, 2025 •

edited

Loading

Uh oh!

ddiddi commented May 22, 2025

Uh oh!

Uh oh!

AeneasZhu commented May 23, 2025

Uh oh!

kth8 commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025

Uh oh!

jvmx commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025 •

edited

Loading

Uh oh!

kth8 commented May 23, 2025

Uh oh!

sinand99 commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025

Uh oh!

robinnarsinghranabhat commented May 23, 2025

Uh oh!

Uh oh!

server : support audio input #13714

server : support audio input #13714

Uh oh!

Conversation

ngxson commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddiddi commented May 22, 2025

Uh oh!

Uh oh!

AeneasZhu commented May 23, 2025

Uh oh!

kth8 commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025

Uh oh!

jvmx commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kth8 commented May 23, 2025

Uh oh!

sinand99 commented May 23, 2025

Uh oh!

ngxson commented May 23, 2025

Uh oh!

robinnarsinghranabhat commented May 23, 2025

Uh oh!

Uh oh!

ngxson commented May 22, 2025 •

edited

Loading

ngxson commented May 23, 2025 •

edited

Loading