Skip to content

[Feat] Add audio benchmarking support /v1/audio/transcriptions #99

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

b8zhong
Copy link

@b8zhong b8zhong commented May 13, 2025

Add Audio Transcription Benchmarking

vLLM supports Whisper, since vllm-project/vllm#12909 and TensorRT in https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.19/examples/whisper (haven't personally tried this one).

Support ASR model benchmarking via /v1/audio/transcriptions.

Changes:

  • New openai-audio backend
  • ASRDataset class for loading/preparing ASR samples from Hugging Face datasets (e.g., LibriSpeech, Common Voice, AMI), including temporary file management. Mostly lifted from vLLM.
  • CLI arguments (--audio-dataset-name, etc.) for ASR data configuration.
  • Unfortunately, i had to modify some of RequestFuncInput, main.py, and Client.py to integrate the audio pipeline.
  • Added librosa, soundfile, datasets dependencies. We can move these to an extra [audio] if necessary as well

Signed-off-by: Brayden Zhong b8zhong@uwaterloo.ca
Co-authored-by: @vincentzed

Example:

fib benchmark \
    --backend openai-audio \
    --model "openai/whisper-large-v3-turbo" \
    --base-url "http://localhost:8000" \
    --endpoint "/v1/audio/transcriptions" \
    --tokenizer "openai/whisper-large-v3-turbo" \
    \
    --audio-dataset-name "edinburghcstr/ami" \
    --audio-dataset-config "ihm" \
    --audio-dataset-split "test" \
    --audio-language "en" \
    --audio-duration-limit 29.5 \
    --audio-max-samples 500 \
    \
    --num-of-req 500

============ Serving Benchmark Result ============
Successful requests:                     500       
Benchmark duration (s):                  14.59     
Total input tokens:                      3500      
Total generated tokens:                  3050      
Request throughput (req/s):              34.27     
Input token throughput (tok/s):          239.88    
Output token throughput (tok/s):         209.04    
---------------Time to First Token----------------
Mean TTFT (ms):                          8803.45   
Median TTFT (ms):                        9122.14   
P99 TTFT (ms):                           12670.75  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.01      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.02      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P99 ITL (ms):                            0.00      
==================================================

small cleanup

remove unused

small cleanup

small cleanup

small cleanup

small cleanup

Small refactor.
@b8zhong
Copy link
Author

b8zhong commented May 14, 2025

Maybe cc @benchislett , thanks in advance 👍

@xinli-centml
Copy link
Contributor

Thanks! @b8zhong , sorry for the delayed the review

calculate_metrics(output["inputs"], output["outputs"], output["time"], tokenizer, output["stream"])
simplified_inputs = None
if args.backend == "openai-audio":
simplified_inputs = [(req["prompt"], req["prompt_len"], req["output_len"]) for req in prepared_requests_data]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if/else looks the same in both branches

"stream": not args.disable_stream,
}

if args.output_file:
filename = args.output_file
if args.num_of_imgs_per_req:
w, h = args.img_ratios_per_req[idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this code moved, or intentionally removed? If the latter, for what reason?

@xinli-centml
Copy link
Contributor

Hi @b8zhong , thanks a lot for the contribution, currently we don't plan to support audio models for benchmarking, so adding support is a bit pre-mature, we will reopen this PR when audio support is added to our inference engine.

@b8zhong
Copy link
Author

b8zhong commented Jun 8, 2025

No problem, thanks Xin + Benjamin for reviewing anyway 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants