Skip to content

[rocm7.0_internal_testing] Prevent static initialization of at::cuda::warp_size() #2293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 25, 2025

Conversation

ethanwee1
Copy link

@ethanwee1 ethanwee1 commented Jun 25, 2025

Fixes SWDEV-540240, SWDEV-540309, SWDEV-539989

Error

#24 437.7   what():  HIP error: no ROCm-capable device is detected
#24 437.7 HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
#24 437.7 For debugging consider passing AMD_SERIALIZE_KERNEL=3
#24 437.7 Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
#24 437.7 Exception raised from c10_hip_check_implementation at /pytorch/c10/hip/HIPException.cpp:44 (most recent call first):
#24 437.7 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x88 (0x7f272de18738 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
#24 437.7 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x55 (0x7f272ddb42ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
...
#24 437.7 frame #7: at::cuda::getCurrentDeviceProperties() + 0x9 (0x7f270b5874e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
#24 437.7 frame #8: at::cuda::warp_size() + 0x9 (0x7f270b587509 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
#24 437.7 frame #9: <unknown function> + 0x81ac8b (0x7f2709c27c8b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)

Explanation

80cca70 created a static global variable that used at::cuda::warp_size() to initialize its value, which needs GPUs to be visible to query device properties. However, GPUs are not present on CPU-only build systems.

Solution

Convert static variable into a static function, thus preventing static initialization.

Validation

http://rocm-ci.amd.com/job/pyt_whl_docker_mainline/1461/artifact/build_artifacts.txt/*view*/

Ran microbenchmark to confirm basic functionality:

root@ubb4-rack-22:/var/lib/jenkins/pytorch-micro-benchmarking# python3 micro_benchmarking_pytorch.py --network resnet50
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : resnet50
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 0.10158218145370483
Throughput [img/sec] : 630.0317544289736=

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jun 25, 2025

Jenkins build for 9991022d48d5480423fc3dc1d3b0fb93cdaa638a commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

Jenkins build for 9991022d48d5480423fc3dc1d3b0fb93cdaa638a commit is in progress
Links: Blue Ocean view / Build artifacts

Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Please submit upstream PR as well.

@jeffdaily
Copy link
Collaborator

jeffdaily commented Jun 25, 2025

I like it. Please submit upstream PR as well.

Actually, can you upstream this in combination with 80cca70.

@jeffdaily jeffdaily marked this pull request as ready for review June 25, 2025 20:43
@jithunnair-amd jithunnair-amd changed the title [Rocm7.0_internal_testing] Fix warpsize for 80cca7006d94df97ee932fd5903ed20c08c2eb34 [rocm7.0_internal_testing] Fix warpsize for 80cca7006d94df97ee932fd5903ed20c08c2eb34 Jun 25, 2025
@jithunnair-amd jithunnair-amd changed the title [rocm7.0_internal_testing] Fix warpsize for 80cca7006d94df97ee932fd5903ed20c08c2eb34 [rocm7.0_internal_testing] Prevent static initialization of at::cuda::warp_size() Jun 25, 2025
@jithunnair-amd jithunnair-amd merged commit 944be5a into rocm7.0_internal_testing Jun 25, 2025
0 of 3 checks passed
@jithunnair-amd jithunnair-amd deleted the rocm7.0_IT_fix_warpsize branch June 25, 2025 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants