Multi-threaded GPU inferencing failing with whisper-small: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node #21413

david-sitsky · 2024-07-19T00:08:26Z

Describe the issue

I created a Whisper ONNX model using https://github.com/microsoft/Olive/blob/main/examples/whisper/README.md, specifically using these commands on a machine with a GPU:

cd Olive/examples/whisper
python3 prepare_whisper_configs.py --model_name openai/whisper-small --multilingual --no_audio_decoder

olive run --config whisper_gpu_fp32.json --setup
olive run --config whisper_gpu_fp32.json

In my application, I break up large audio files into chunks and then execute each chunk against the Whisper model for transcription using threads.

On a CPU machine (with a model generated on a non-GPU machine) this works fine. However on a GPU machine this fails.

To reproduce

On a g5.4xlarge instance in AWS, using Ubuntu 22.04, CUDA 11.8, Nvidia A10G GPU, the following program reproduces the error:

from multiprocessing.pool import ThreadPool
import onnxruntime
from onnxruntime_extensions import get_library_path as _lib_path
import numpy as np
import librosa

ort_session_options = onnxruntime.SessionOptions()
ort_session_options.register_custom_ops_library(_lib_path())

ort_session = onnxruntime.InferenceSession("whisper_small.onnx", ort_session_options, providers=['CUDAExecutionProvider'])

def infer(i):
    audio_blob, _ = librosa.load("speech.wav")
    inputs = {
        "audio_pcm": np.expand_dims(audio_blob, axis=0),
        "max_length": np.asarray([200], dtype=np.int32),
        "min_length": np.asarray([0], dtype=np.int32),
        "num_beams": np.asarray([2], dtype=np.int32),
        "num_return_sequences": np.asarray([1], dtype=np.int32),
        "length_penalty": np.asarray([1.0], dtype=np.float32),
        "repetition_penalty": np.asarray([1.0], dtype=np.float32),
        "decoder_input_ids": np.expand_dims(np.asarray([50258, 50360], dtype=np.int32), axis=0)
    }

    return ort_session.run(None, inputs)[0]

THREAD_NUMBER=10
pool = ThreadPool(THREAD_NUMBER)

for i in range(10):
    pool.map(infer, range(THREAD_NUMBER))
    pool.join
    print(f"Iteration: {i} done")

pool.close

An example of an error run:

$ python3 tryme.py 
2024-07-18 23:30:42.174926508 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 3 Memcpy nodes are added to the graph main_graph_beam-search-test_og_BpeDecoder_2 for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-07-18 23:30:42.176827822 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 4 Memcpy nodes are added to the graph decoder subgraph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-07-18 23:30:42.182986135 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-07-18 23:30:42.183008286 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-07-18 23:30:46.346349145 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running DecoderMaskedMultiHeadAttention node. Name:'Attention_0' Status Message: Input sequence length should be 1 to use DecoderMaskedMultiHeadAttention. Actual length is 413
2024-07-18 23:30:46.346977146 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node. Name:'Attention_0' Status Message: Input sequence length should be 1 to use DecoderMaskedMultiHeadAttention. Actual length is 413
Traceback (most recent call last):
  File "/home/ubuntu/ugh/tryme.py", line 33, in <module>
    pool.map(infer, range(THREAD_NUMBER))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/ubuntu/ugh/tryme.py", line 27, in infer
    ort_session.run(None, inputs)[0]
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node. Name:'Attention_0' Status Message: Input sequence length should be 1 to use DecoderMaskedMultiHeadAttention. Actual length is 413
Exception ignored in: <function Pool.__del__ at 0x7ae33d8b57e0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 271, in __del__
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 371, in put
AttributeError: 'NoneType' object has no attribute 'dumps'

The program works fine when THREAD_NUMBER=1.

The speech.wav file can be downloaded from https://resources.djl.ai/audios/speech.wav.

If it helps, I can try and put the model I generated somewhere, but it is 1.1G in size.

Urgency

This is a blocker for deploying our application so it is urgent. We are actually using the Java bindings of onnxruntime with https://github.com/deepjavalibrary/djl-serving which is hitting this very issue when it receives concurrent requests. I wrote the python program so that it is easier to reproduce, but it is exactly the same error message.

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu-1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

The text was updated successfully, but these errors were encountered:

david-sitsky · 2024-07-19T00:21:12Z

To save time, the whisper_small.onnx file can be downloaded from here: https://drive.google.com/file/d/10yDz-VI-iKsszNgyOjbRqYTfwUHap_a-/view?usp=sharing.

tianleiwu · 2024-07-19T05:06:46Z

It is a known issue that some operators are not thread-safe. Like Attention and MultiHeadAttention used in whisper encoder is not thread safe. You may try set an environment variable ORT_DISABLE_FUSED_ATTENTION=1. However, that will increase latency since some fused attention kernel are disabled.

Another walkaround is to use different session per thread, that could avoid thread safe issue, but probably won't help performance since multiple session competing the same GPU resource.

What's the reason to use use multiple threading for same session? Usually have no performance benefit. Instead, you can try increasing batch size to see whether it could increase throughput.

david-sitsky · 2024-07-19T05:44:50Z

Setting those environment variables did not help sadly. I still see similar errors, but also new ones that look even worse:

2024-07-19 05:32:03.173536756 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running MatMul node. Name:'/whisper_decoder_init/proj_out/MatMul' Status Message: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=312 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop, UseTF32()); 
Aborted (core dumped)

I definitely want to use the same session so the model is only loaded once into GPU memory. It is true batching can be used here instead and I'll look into that.

I was initially looking at threading since the whisper model pre-processing, which reads the audio data and converts it into the appropriate format (padding, then doing log mel spectrogram conversion) is CPU work only which could be parallelised. With a batching approach this pre-processing work will be serialised.

The documents I read indicated that the ORT session is thread-safe. Since this is not the case, is there a list of operators documented somewhere that are not?

tianleiwu · 2024-07-19T19:06:44Z

@david-sitsky, you may try use multiple threading for pre-processing in CPU to see whether it helps.

The new error indicates that other places have thread-safe issue in CUDA provider. It may take time to nail the root cause.

A related older issue: #18806

### Description - [x] Rewrite FusedMHARunnerFP16v2 to make it thread-safe. - [x] Add multi-threading tests Previously, the kernel parameters params is stored as a member of mha runner, which means that different threads might change the params at the same time and impacts the other threads. For example, if batch_size and seq_len was changed by another thread to larger values in setup(...), buffer overrun might happen in run(...) because a kernel could read/write memory out of range of allocated buffers. In new implementation, I change the api and remove mutable member variables to make it thread safe. Below is summary of change: Before: ``` class FusedMHARunnerFP16v2::mhaImpl { void setup(int seq_len, int batch_size) { // change scalar params } void run(input, output) { // change params for input and output pointers // launch kernel using params } Fused_multihead_attention_params_v2 params; // mutable, not thread-safe } ``` After: ``` class FusedMHARunnerFP16v2::FmhaImpl { void setup(int seq_len, int batch_size, Fused_multihead_attention_params_v2& params) { // change params } void run(params, input, output) { // change params with input and output pointers // launch kernel using params } } ``` ### Motivation and Context #18854 #21413

david-sitsky · 2024-08-06T05:15:05Z

@tianleiwu - any ideas with next steps with the thread-safe issue? I am using djl-serving, so sometimes the server will receive multiple unrelated requests from different clients on the same Whisper model, and I see the issue being hit pretty easily. I am using OnnxRuntime 1.17.3.

2024-08-06 04:54:13.828882272 [E:onnxruntime:ort-java, cuda_call.cc:116 CudaCall] CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=324 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop); 

2024-08-06 04:54:13.828924164 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running MatMul node. Name:'/whisper_decoder_init/proj_out/MatMul' Status Message: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=324 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop); 

2024-08-06 04:54:13.829005579 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running MatMul node. Name:'/whisper_decoder_init/proj_out/MatMul' Status Message: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=324 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop); 

terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'

tianleiwu · 2024-08-07T03:44:51Z

@david-sitsky, The BeamSearch operator is not thread safe since it has internal state.

One way for serving is to have a queue for user's requests, and batch user's request to call onnxruntime inference to avoid multiple threading. I am not sure whether it is supported by bjl-serving.

I could take a look at design change to make it thread safe. That might be targeted for 1.20 release (3+ months away).

david-sitsky · 2024-08-08T00:17:10Z

Many thanks @tianleiwu - that would be great!

github-actions bot added api:Java issues related to the Java API ep:CUDA issues related to the CUDA execution provider labels Jul 19, 2024

tianleiwu mentioned this issue Jul 19, 2024

[CUDA] FusedMHARunnerFP16v2 thread-safe #21420

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded GPU inferencing failing with whisper-small: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node #21413

Multi-threaded GPU inferencing failing with whisper-small: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node #21413

david-sitsky commented Jul 19, 2024

david-sitsky commented Jul 19, 2024

tianleiwu commented Jul 19, 2024 •

edited

Loading

david-sitsky commented Jul 19, 2024

tianleiwu commented Jul 19, 2024

david-sitsky commented Aug 6, 2024

tianleiwu commented Aug 7, 2024 •

edited

Loading

david-sitsky commented Aug 8, 2024

Multi-threaded GPU inferencing failing with whisper-small: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node #21413

Multi-threaded GPU inferencing failing with whisper-small: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node #21413

Comments

david-sitsky commented Jul 19, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

david-sitsky commented Jul 19, 2024

tianleiwu commented Jul 19, 2024 • edited Loading

david-sitsky commented Jul 19, 2024

tianleiwu commented Jul 19, 2024

david-sitsky commented Aug 6, 2024

tianleiwu commented Aug 7, 2024 • edited Loading

david-sitsky commented Aug 8, 2024

tianleiwu commented Jul 19, 2024 •

edited

Loading

tianleiwu commented Aug 7, 2024 •

edited

Loading