Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded GPU inferencing failing with whisper-small: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node #21413

Open
david-sitsky opened this issue Jul 19, 2024 · 7 comments
Labels
api:Java issues related to the Java API ep:CUDA issues related to the CUDA execution provider

Comments

@david-sitsky
Copy link

Describe the issue

I created a Whisper ONNX model using https://github.com/microsoft/Olive/blob/main/examples/whisper/README.md, specifically using these commands on a machine with a GPU:

cd Olive/examples/whisper
python3 prepare_whisper_configs.py --model_name openai/whisper-small --multilingual --no_audio_decoder

olive run --config whisper_gpu_fp32.json --setup
olive run --config whisper_gpu_fp32.json

In my application, I break up large audio files into chunks and then execute each chunk against the Whisper model for transcription using threads.

On a CPU machine (with a model generated on a non-GPU machine) this works fine. However on a GPU machine this fails.

To reproduce

On a g5.4xlarge instance in AWS, using Ubuntu 22.04, CUDA 11.8, Nvidia A10G GPU, the following program reproduces the error:

from multiprocessing.pool import ThreadPool
import onnxruntime
from onnxruntime_extensions import get_library_path as _lib_path
import numpy as np
import librosa

ort_session_options = onnxruntime.SessionOptions()
ort_session_options.register_custom_ops_library(_lib_path())

ort_session = onnxruntime.InferenceSession("whisper_small.onnx", ort_session_options, providers=['CUDAExecutionProvider'])

def infer(i):
    audio_blob, _ = librosa.load("speech.wav")
    inputs = {
        "audio_pcm": np.expand_dims(audio_blob, axis=0),
        "max_length": np.asarray([200], dtype=np.int32),
        "min_length": np.asarray([0], dtype=np.int32),
        "num_beams": np.asarray([2], dtype=np.int32),
        "num_return_sequences": np.asarray([1], dtype=np.int32),
        "length_penalty": np.asarray([1.0], dtype=np.float32),
        "repetition_penalty": np.asarray([1.0], dtype=np.float32),
        "decoder_input_ids": np.expand_dims(np.asarray([50258, 50360], dtype=np.int32), axis=0)
    }

    return ort_session.run(None, inputs)[0]

THREAD_NUMBER=10
pool = ThreadPool(THREAD_NUMBER)

for i in range(10):
    pool.map(infer, range(THREAD_NUMBER))
    pool.join
    print(f"Iteration: {i} done")

pool.close

An example of an error run:

$ python3 tryme.py 
2024-07-18 23:30:42.174926508 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 3 Memcpy nodes are added to the graph main_graph_beam-search-test_og_BpeDecoder_2 for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-07-18 23:30:42.176827822 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 4 Memcpy nodes are added to the graph decoder subgraph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-07-18 23:30:42.182986135 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-07-18 23:30:42.183008286 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-07-18 23:30:46.346349145 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running DecoderMaskedMultiHeadAttention node. Name:'Attention_0' Status Message: Input sequence length should be 1 to use DecoderMaskedMultiHeadAttention. Actual length is 413
2024-07-18 23:30:46.346977146 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node. Name:'Attention_0' Status Message: Input sequence length should be 1 to use DecoderMaskedMultiHeadAttention. Actual length is 413
Traceback (most recent call last):
  File "/home/ubuntu/ugh/tryme.py", line 33, in <module>
    pool.map(infer, range(THREAD_NUMBER))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/ubuntu/ugh/tryme.py", line 27, in infer
    ort_session.run(None, inputs)[0]
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node. Name:'Attention_0' Status Message: Input sequence length should be 1 to use DecoderMaskedMultiHeadAttention. Actual length is 413
Exception ignored in: <function Pool.__del__ at 0x7ae33d8b57e0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 271, in __del__
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 371, in put
AttributeError: 'NoneType' object has no attribute 'dumps'

The program works fine when THREAD_NUMBER=1.

The speech.wav file can be downloaded from https://resources.djl.ai/audios/speech.wav.

If it helps, I can try and put the model I generated somewhere, but it is 1.1G in size.

Urgency

This is a blocker for deploying our application so it is urgent. We are actually using the Java bindings of onnxruntime with https://github.com/deepjavalibrary/djl-serving which is hitting this very issue when it receives concurrent requests. I wrote the python program so that it is easier to reproduce, but it is exactly the same error message.

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu-1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

@github-actions github-actions bot added api:Java issues related to the Java API ep:CUDA issues related to the CUDA execution provider labels Jul 19, 2024
@david-sitsky
Copy link
Author

To save time, the whisper_small.onnx file can be downloaded from here: https://drive.google.com/file/d/10yDz-VI-iKsszNgyOjbRqYTfwUHap_a-/view?usp=sharing.

@tianleiwu
Copy link
Contributor

tianleiwu commented Jul 19, 2024

It is a known issue that some operators are not thread-safe. Like Attention and MultiHeadAttention used in whisper encoder is not thread safe. You may try set an environment variable ORT_DISABLE_FUSED_ATTENTION=1. However, that will increase latency since some fused attention kernel are disabled.

Another walkaround is to use different session per thread, that could avoid thread safe issue, but probably won't help performance since multiple session competing the same GPU resource.

What's the reason to use use multiple threading for same session? Usually have no performance benefit. Instead, you can try increasing batch size to see whether it could increase throughput.

@david-sitsky
Copy link
Author

Setting those environment variables did not help sadly. I still see similar errors, but also new ones that look even worse:

2024-07-19 05:32:03.173536756 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running MatMul node. Name:'/whisper_decoder_init/proj_out/MatMul' Status Message: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=312 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop, UseTF32()); 
Aborted (core dumped)

I definitely want to use the same session so the model is only loaded once into GPU memory. It is true batching can be used here instead and I'll look into that.

I was initially looking at threading since the whisper model pre-processing, which reads the audio data and converts it into the appropriate format (padding, then doing log mel spectrogram conversion) is CPU work only which could be parallelised. With a batching approach this pre-processing work will be serialised.

The documents I read indicated that the ORT session is thread-safe. Since this is not the case, is there a list of operators documented somewhere that are not?

@tianleiwu
Copy link
Contributor

@david-sitsky, you may try use multiple threading for pre-processing in CPU to see whether it helps.

The new error indicates that other places have thread-safe issue in CUDA provider. It may take time to nail the root cause.

A related older issue: #18806

tianleiwu added a commit that referenced this issue Jul 22, 2024
### Description
- [x] Rewrite FusedMHARunnerFP16v2 to make it thread-safe. 
- [x] Add multi-threading tests

Previously, the kernel parameters params is stored as a member of mha
runner, which means that different threads might change the params at
the same time and impacts the other threads.

For example, if batch_size and seq_len was changed by another thread to
larger values in setup(...), buffer overrun might happen in run(...)
because a kernel could read/write memory out of range of allocated
buffers.

In new implementation, I change the api and remove mutable member
variables to make it thread safe. Below is summary of change:

Before:
```
class FusedMHARunnerFP16v2::mhaImpl {
   void setup(int seq_len, int batch_size) {
      // change scalar params
   }

   void run(input, output) {
      // change params for input and output pointers
      // launch kernel using params
   }

   Fused_multihead_attention_params_v2 params; // mutable, not thread-safe
}
```

After:
```
class FusedMHARunnerFP16v2::FmhaImpl {
   void setup(int seq_len, int batch_size, Fused_multihead_attention_params_v2& params) {
      // change params
   }

   void run(params, input, output) {
      // change params with input and output pointers
      // launch kernel using params
   }
}
```

### Motivation and Context
#18854
#21413
@david-sitsky
Copy link
Author

@tianleiwu - any ideas with next steps with the thread-safe issue? I am using djl-serving, so sometimes the server will receive multiple unrelated requests from different clients on the same Whisper model, and I see the issue being hit pretty easily. I am using OnnxRuntime 1.17.3.

2024-08-06 04:54:13.828882272 [E:onnxruntime:ort-java, cuda_call.cc:116 CudaCall] CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=324 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop); 

2024-08-06 04:54:13.828924164 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running MatMul node. Name:'/whisper_decoder_init/proj_out/MatMul' Status Message: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=324 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop); 

2024-08-06 04:54:13.829005579 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running MatMul node. Name:'/whisper_decoder_init/proj_out/MatMul' Status Message: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=324 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop); 

terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'

@tianleiwu
Copy link
Contributor

tianleiwu commented Aug 7, 2024

@david-sitsky, The BeamSearch operator is not thread safe since it has internal state.

One way for serving is to have a queue for user's requests, and batch user's request to call onnxruntime inference to avoid multiple threading. I am not sure whether it is supported by bjl-serving.

I could take a look at design change to make it thread safe. That might be targeted for 1.20 release (3+ months away).

@david-sitsky
Copy link
Author

Many thanks @tianleiwu - that would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:Java issues related to the Java API ep:CUDA issues related to the CUDA execution provider
Projects
None yet
Development

No branches or pull requests

2 participants