Non-zero status code returned while running BatchNormalization node

Hello,

I have a simple onnx model of size 89 KB and I use a large batch size to do the inferencing.

OS: Centos 7
GPU: NVIDIA 1080 GTX
CUDA: 11.0
CuDNN: 8.1.1
onnxruntime-gpu version: 1.7.0

Whereas the inferencing goes smoothly for onnxruntime-gpu up to a batch size of 65535, I start to get the following error when the batch size > 65535.
(NOTE: CPU inferencing of my model proceeds fine with onnxruntime and even with batch sizes > 1 Million)

2021-03-22 16:59:54.200488858 [E:onnxruntime:Default, cuda_call.cc:119 CudaCall] CUDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=blipp73.sdp.research.bell-labs.com ; expr=cudnnBatchNormalizationForwardInference( CudnnHandle(), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_); 
2021-03-22 16:59:54.200560113 [E:onnxruntime:, sequential_executor.cc:339 Execute] Non-zero status code returned while running BatchNormalization node. Name:'batch_normalization' Status Message: CUDNN error executing cudnnBatchNormalizationForwardInference( CudnnHandle(), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_)
Traceback (most recent call last):
  File "onnxruntime_test.1.4.0.py", line 118, in <module>
    sys.exit(main())
  File "onnxruntime_test.1.4.0.py", line 102, in main
    sess.run([], feeds)  # fetch all outputs
  File "/home/tfs/venv_ORTGPU_test/lib64/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running BatchNormalization node. Name:'batch_normalization' Status Message: CUDNN error executing cudnnBatchNormalizationForwardInference( CudnnHandle(), cudnn_batch_norm_mode_, &alpha, &beta, data_desc, x_data, data_desc, y_data, bn_tensor_desc, scale_data, b_data, mean_data, var_data, epsilon_)

Upon investigation, I came across this mxnet discussion:
https://github.com/apache/incubator-mxnet/issues/4997#issuecomment-279485855 

But could not verify the limits as set in the CuDNN library for batch size... Reporting this issue so that there is a record. If you can investigate and paste the CuDNN definition for maximum batch size allowed for cudnnBatchNormalizationForwardInference, it would be great.

Thank you,
Buvana


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-zero status code returned while running BatchNormalization node #7095

abuvaneswari
openedon Mar 22, 2021

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Non-zero status code returned while running BatchNormalization node #7095

Description

abuvaneswariopenedon Mar 22, 2021

Metadata