Skip to content

Non-zero status code returned while running BatchNormalization node #7095

Open

Description

Hello,

I have a simple onnx model of size 89 KB and I use a large batch size to do the inferencing.

OS: Centos 7
GPU: NVIDIA 1080 GTX
CUDA: 11.0
CuDNN: 8.1.1
onnxruntime-gpu version: 1.7.0

Whereas the inferencing goes smoothly for onnxruntime-gpu up to a batch size of 65535, I start to get the following error when the batch size > 65535.
(NOTE: CPU inferencing of my model proceeds fine with onnxruntime and even with batch sizes > 1 Million)

2021-03-22 16:59:54.200488858 [E:onnxruntime:Default, cuda_call.cc:119 CudaCall] CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=blipp73.sdp.research.bell-labs.com ; expr=cudnnBatchNormalizationForwardInference( CudnnHandle(), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_);
2021-03-22 16:59:54.200560113 [E:onnxruntime:, sequential_executor.cc:339 Execute] Non-zero status code returned while running BatchNormalization node. Name:'batch_normalization' Status Message: CUDNN error executing cudnnBatchNormalizationForwardInference( CudnnHandle(), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_)
Traceback (most recent call last):
File "onnxruntime_test.1.4.0.py", line 118, in
sys.exit(main())
File "onnxruntime_test.1.4.0.py", line 102, in main
sess.run([], feeds) # fetch all outputs
File "/home/tfs/venv_ORTGPU_test/lib64/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run
return self.sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running BatchNormalization node. Name:'batch_normalization' Status Message: CUDNN error executing cudnnBatchNormalizationForwardInference( CudnnHandle(), cudnn_batch_norm_mode
, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_)

Upon investigation, I came across this mxnet discussion:
apache/mxnet#4997 (comment)

But could not verify the limits as set in the CuDNN library for batch size... Reporting this issue so that there is a record. If you can investigate and paste the CuDNN definition for maximum batch size allowed for cudnnBatchNormalizationForwardInference, it would be great.

Thank you,
Buvana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    ep:CUDAissues related to the CUDA execution provider

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions