Add CUDA architecture 120 support for RTX 5090 GPUs#26226
Closed
Add CUDA architecture 120 support for RTX 5090 GPUs#26226
Conversation
Co-authored-by: snnn <856316+snnn@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] no kernel image is available for execution on the device [rtx 5090 laptop, wan2.2 animate, DWPreprocessor, onnxruntime-gpu]
Add CUDA architecture 120 support for RTX 5090 GPUs
Oct 3, 2025
snnn
reviewed
Oct 3, 2025
| docker run -e SYSTEM_COLLECTIONURI --rm --volume \ | ||
| $BUILD_SOURCESDIRECTORY:/onnxruntime_src --volume $BUILD_BINARIESDIRECTORY:/build -e NIGHTLY_BUILD onnxruntimecuda${CUDA_VERSION_MAJOR}build \ | ||
| /bin/bash -c "/usr/bin/python3 /onnxruntime_src/tools/ci_build/build.py --enable_lto --build_java --build_nodejs --build_dir /build --config Release --skip_submodule_sync --parallel --use_binskim_compliant_compile_flags --build_shared_lib --use_cuda --cuda_version=$CUDA_VERSION --cuda_home=/usr/local/cuda-$CUDA_VERSION --cudnn_home=/usr/local/cuda-$CUDA_VERSION --skip_tests --use_vcpkg --use_vcpkg_ms_internal_asset_cache --cmake_extra_defines 'CMAKE_CUDA_ARCHITECTURES=60-real;70-real;75-real;80-real;90a-real;90a-virtual' 'onnxruntime_USE_FPA_INTB_GEMM=OFF' && cd /build/Release && make install DESTDIR=/build/installed" | ||
| /bin/bash -c "/usr/bin/python3 /onnxruntime_src/tools/ci_build/build.py --enable_lto --build_java --build_nodejs --build_dir /build --config Release --skip_submodule_sync --parallel --use_binskim_compliant_compile_flags --build_shared_lib --use_cuda --cuda_version=$CUDA_VERSION --cuda_home=/usr/local/cuda-$CUDA_VERSION --cudnn_home=/usr/local/cuda-$CUDA_VERSION --skip_tests --use_vcpkg --use_vcpkg_ms_internal_asset_cache --cmake_extra_defines 'CMAKE_CUDA_ARCHITECTURES=60-real;70-real;75-real;80-real;90a-real;90a-virtual;120a-real' 'onnxruntime_USE_FPA_INTB_GEMM=OFF' && cd /build/Release && make install DESTDIR=/build/installed" |
Contributor
There was a problem hiding this comment.
We didn't have120a there. Why did it work before the 1.23.0 release?
There was a problem hiding this comment.
even if package becomes too big can you publish a version as a release on here that will have all the cuda archs?
Contributor
There was a problem hiding this comment.
120a-real need cuda 12.8 or above in CI Pipeline so this PR will not work.
1.22 release used 90-virtual. 1.23 uses 90a-virtual. That could be the root cause. We shall make a change to use 90-virtual.
tianleiwu
added a commit
that referenced
this pull request
Oct 7, 2025
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
apsonawane
pushed a commit
that referenced
this pull request
Oct 17, 2025
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
apsonawane
pushed a commit
that referenced
this pull request
Oct 20, 2025
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
fs-eire
pushed a commit
that referenced
this pull request
Oct 24, 2025
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
naomiOvad
pushed a commit
to naomiOvad/onnxruntime
that referenced
this pull request
Nov 2, 2025
…osoft#26230) Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: microsoft#26002 microsoft#26226 microsoft#26181
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu:
This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0), which was not included in the packaged builds due to PyPI size constraints. The current onnxruntime-gpu packages only include architectures: 52, 61, 75, 86, 89, and 90a-virtual.
Solution
This PR adds CUDA compute architecture 120 with accelerated features (
120a-real) to all packaging build configurations:The format
120a-realis used because:120= CUDA compute capability 12.0 (RTX 5090)asuffix = Enables accelerated features (WGMMA, TMA, setmaxnreg) for SM >= 90, as defined incmake/external/cuda_configuration.cmake-realsuffix = Compiles for specific hardware (vs.-virtualfor PTX)Changes
7 files modified with minimal changes to
CMAKE_CUDA_ARCHITECTURESdefinitions in packaging pipelines:tools/ci_build/github/azure-pipelines/stages/py-gpu-packaging-stage.yml- Windows Python wheelstools/ci_build/github/azure-pipelines/custom-nuget-packaging-pipeline.yml- Custom NuGet packagestools/ci_build/github/azure-pipelines/stages/nuget-win-cuda-packaging-stage.yml- Windows CUDA/TensorRT NuGet packagestools/ci_build/github/linux/build_linux_python_package.sh- Linux Python wheelstools/ci_build/github/linux/build_cuda_c_api_package.sh- Linux CUDA C API packagestools/ci_build/github/linux/build_nodejs_package.sh- Linux Node.js packagestools/ci_build/github/linux/build_tensorrt_c_api_package.sh- Linux TensorRT C API packagesCI/test pipelines targeting specific hardware were intentionally left unchanged as they are not for distribution.
Impact
After these changes are built and released, RTX 5090 users will be able to run ONNX Runtime GPU workloads without the "no kernel image is available" error.
Fixes #10028 (referenced issue in ComfyUI)
Co-authored-by: @snnn
Original prompt
This section details on the original issue you should resolve
<issue_title>no kernel image is available for execution on the device [rtx 5090 laptop, wan2.2 animate, DWPreprocessor, onnxruntime-gpu]</issue_title>
<issue_description>Hello.
at first I've created thread on ComfyUI https://github.com/comfyanonymous/ComfyUI/issues/10028
There I found out that other people have the same issue and it's related to onnxruntime
error
[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the deviceSo I have installed:
Some people said that it's probably problem between onnxruntime and rtx50 GPU series
Comfy support also replied me with next words
`# ComfyUI Error Report
Error Details
Stack Trace
File "C:\comfy\ComfyUI\execution.py", line 496, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\comfy\ComfyUI\execution.py", line 315, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\comfy\ComfyUI\execution.py", line 289, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "C:\comfy\ComfyUI\execution.py", line 277, in process_inputs
result = f(**inputs)
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\node_wrappers\dwpose.py", line 91, in estimate_pose
out = common_annotator_call(func, image, include_hand=detect_hand, include_face=detect_face, include_body=detect_body, image_and_json=True, resolution=resolution, xinsr_stick_scaling=scale_stick_for_xinsr_cn)
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\utils.py", line 85, in common_annotator_call
np_result = model(np_image, output_type="np", detect_resolution=detect_resolution, **kwargs)
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\node_wrappers\dwpose.py", line 87, in func
pose_img, openpose_dict = model(image, **kwargs)
~~~~~^^^^^^^^^^^^^^^^^
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose_init_.py", line 266, in call
poses = self.detect_poses(input_image)
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose_init_.py", line 255, in detect_poses
keypoints_info = self.dw_pose_estimation(oriImg.copy())
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose\wholebody.py", line 93, in call
det_result = inference_onnx_yolox(self.det, oriImg, detect_classes=[0], dtype=np.float32)
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose\dw_onnx\cv_ox_det.py", line 104, in inference_detector
output = session.run(None, {input_name: input})
File "c:\comfy.venv\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 275, in run
return self._sess.run(output_names, input_feed, run_options)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
System Information
Devices
Log...
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.