Skip to content

server : support audio input #13714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 23, 2025
Merged

server : support audio input #13714

merged 2 commits into from
May 23, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 22, 2025

Cont #13623

Pre-quantized models (recommended 8B model, it has much better quality than the 1B):

llama-server -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF
llama-server -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF

Try via web UI, summarize this fireship video

image

OAI-compat API:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "transcribe this audio"
        },
        {
          "type": "input_audio",
          "input_audio": {
            "data": ".....(base64 encoded audio data).....",
            "format": "mp3"
          }
        }
      ]
    }
  ]
}

@github-actions github-actions bot added the python python script changes label May 22, 2025
@ddiddi
Copy link

ddiddi commented May 22, 2025

LGTM!

@ngxson ngxson merged commit 9ecf3e6 into ggml-org:master May 23, 2025
48 checks passed
@AeneasZhu
Copy link

Excuse me. My device couldn't launch llama-server when load ultravox. Here is my command:
./llama-server -m models/Llama-31-8B.gguf --mmproj models/mmproj/mmproj-ultravox-v0_5-llama-3_1.gguf -ngl 33 --port 8090

And it terminated and showed:

image
image
image
image

Any solution please?

@kth8
Copy link

kth8 commented May 23, 2025

I compiled the server myself and it started without problem although when I tried to upload a 35 minute mp3 I got this error
img
so I switched to this script for the server to process the ~13.2k tokens for this mp3

# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "openai",
#
# ]
# ///
from base64 import b64encode
from openai import OpenAI

image_path = "shortstory034_aladdinandthemagiclamp_llf_64kb.mp3"

with open(image_path, "rb") as file:
    audio_base64 = b64encode(file.read()).decode('utf-8')

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="llamacpp")

completion = client.chat.completions.create(
    model="ultravox",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Summarize this story."},
                {"type": "input_audio", "input_audio": {"data": audio_base64, "format": "mp3"}}
            ]
        }
    ],
    stream=True
)

for chunk in completion:
    if chunk.choices and chunk.choices[0].delta:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end='', flush=True)
print()

@ngxson
Copy link
Collaborator Author

ngxson commented May 23, 2025

@AeneasZhu can you try --no-mmproj-offload to see it if works? It will run the audio encoder on cpu. Also add --verbose to give more verbose log

@kth8 I'm doubt that model is trained with audio that long, but we can remove the 10MB restriction on frontend if needed

@jvmx
Copy link

jvmx commented May 23, 2025

awesome - is there a way to use this to transcribe a live stream of audio or does it have to be a complete pre-made audio file?

@ngxson
Copy link
Collaborator Author

ngxson commented May 23, 2025

These models are text-audio-to-text, not ASR, so I don't think they are trained or optimized for streamed real-time transcription

@kth8
Copy link

kth8 commented May 23, 2025

@kth8 I'm doubt that model is trained with audio that long, but we can remove the 10MB restriction on frontend if needed

oh, what is the practical limit of this model? The model card didn't mention any best practices. Does it work best if I limit the audio length to ~5 mins like in your example?

@sinand99
Copy link

Excuse me. My device couldn't launch llama-server when load ultravox. Here is my command: ./llama-server -m models/Llama-31-8B.gguf --mmproj models/mmproj/mmproj-ultravox-v0_5-llama-3_1.gguf -ngl 33 --port 8090

And it terminated and showed:

image image image image

Any solution please?

I have exactly the same problem with the same GPU. I created a new issue.

@ngxson
Copy link
Collaborator Author

ngxson commented May 23, 2025

@kth8 there is no clear limit for the model.

The 10MB limit is frontend-only, it's there becasue we don't want user to accidentally upload multi-gigabytes file which will crash the web page. But we can remove it anyway

@robinnarsinghranabhat
Copy link

@ngxson have been following you for a while from your smol-vlm release.

What does it take to support Qwen2-Audio Instruct ? it seems to have similar interface like above model. default huggingface-python is too slow for my usecase.

I wanted to understand and contribute on llama.cpp by taking this on.
any helpful set of PRs I could refer to for this ? or guidance is highly appreciate !

Just an open source lover wanting to help !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants