Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][HIP] Unresolved Assert/.* tests failures #7634

Open
againull opened this issue Dec 5, 2022 · 17 comments
Open

[SYCL][HIP] Unresolved Assert/.* tests failures #7634

againull opened this issue Dec 5, 2022 · 17 comments
Labels
bug Something isn't working cuda CUDA back-end hip Issues related to execution on HIP backend.

Comments

@againull
Copy link
Contributor

againull commented Dec 5, 2022

LIT testing on HIP backend is failing:
Unresolved tests


Unresolved Tests (5):
SYCL :: Assert/assert_in_kernels.cpp
SYCL :: Assert/assert_in_multiple_tus.cpp
SYCL :: Assert/assert_in_multiple_tus_one_ndebug.cpp
SYCL :: Assert/assert_in_one_kernel.cpp
SYCL :: Assert/assert_in_simultaneously_multiple_tus_one_ndebug.cpp

Example:
https://github.com/intel/llvm/actions/runs/3616100865/jobs/6093960264

Error:


UNRESOLVED: SYCL :: Assert/assert_in_kernels.cpp (1031 of 1031)
******************** TEST 'SYCL :: Assert/assert_in_kernels.cpp' FAILED ********************
Exception during script execution:
Traceback (most recent call last):
File "/__w/llvm/llvm/lit/lit/worker.py", line 76, in _execute_test_handle_errors
result = test.config.test_format.execute(test, lit_config)
File "/__w/llvm/llvm/lit/lit/formats/shtest.py", line 27, in execute
return lit.TestRunner.executeShTest(test, litConfig,
File "/__w/llvm/llvm/lit/lit/TestRunner.py", line 2005, in executeShTest
return _runShTest(test, litConfig, useExternalSh, script, tmpBase)
File "/__w/llvm/llvm/lit/lit/TestRunner.py", line 1966, in _runShTest
output = """Script:\n--\n%s\n--\nExit Code: %d\n""" % (
TypeError: %d format: a number is required, not NoneType

@pvchupin
Copy link
Contributor

pvchupin commented Dec 7, 2022

I couldn't reproduce this problem on the same machine outside of CI, even with the same docker image.
Runs are successful with the following in the assert_in_kernels.cpp.tmp.gpu.txt:

SYCL/Assert/assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [0,0,0], local id: [0,0,0] Assertion `Buf[wiID] == 0 && "from assert statement"` failed.                              
SYCL/Assert/assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [2,0,0], local id: [2,0,0] Assertion `Buf[wiID] == 0 && "from assert statement"` failed.                              
:0:rocdevice.cpp            :2672: 248509772177 us: 156615: [tid:0x7fa71f6f7700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016            

I logged into machine when CI was running the test and found that all these tests hang for a few minutes with only the first 2 lines in the assert_in_kernels.cpp.tmp.gpu.txt:

SYCL/Assert/assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [0,0,0], local id: [0,0,0] Assertion `Buf[wiID] == 0 && "from assert statement"` failed.                              
SYCL/Assert/assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [2,0,0], local id: [2,0,0] Assertion `Buf[wiID] == 0 && "from assert statement"` failed.                              

We run cpu, then gpu, then acc and acc log doesn't exist. So I assume execution stuck on GPU at that point.

Also there is the following in dmesg:

[580980.987697] static-buffer-d[1046286]: segfault at 39 ip 00007f4b0fc7a871 sp 00007ffefa681d40 error 4 in libamdhip64.so.5.4.50400[7f4b0fbc1000+37e000]
[580980.987703] Code: 48 85 f6 0f 84 a6 00 00 00 4c 8d 2d 99 1c 8f 01 48 63 93 b0 00 00 00 4c 8d 25 bb 86 00 00 49 8b 45 00 48 8b 04 d0 48 8b 40 68 <48> 8b 40 18 48 8b 38 48 8b 07 48 8b 80 e8 00 00 00 4c 39 e0 0f 85

It seems in some conditions test execution doesn't return normally and killed from the outside.

@bader
Copy link
Contributor

bader commented Dec 7, 2022

These tests seem to exceed timeout limit:

Slowest Tests:

600.02s: SYCL :: Assert/assert_in_kernels.cpp
600.02s: SYCL :: Assert/assert_in_multiple_tus_one_ndebug.cpp
600.02s: SYCL :: Assert/assert_in_multiple_tus.cpp
600.02s: SYCL :: Assert/assert_in_one_kernel.cpp
600.02s: SYCL :: Assert/assert_in_simultaneously_multiple_tus_one_ndebug.cpp

600 sec is the limit for a single test - https://github.com/intel/llvm-test-suite/blob/intel/SYCL/lit.cfg.py#L430-L435.

@bader
Copy link
Contributor

bader commented Dec 7, 2022

Also there is the following in dmesg:

[580980.987697] static-buffer-d[1046286]: segfault at 39 ip 00007f4b0fc7a871 sp 00007ffefa681d40 error 4 in libamdhip64.so.5.4.50400[7f4b0fbc1000+37e000]
[580980.987703] Code: 48 85 f6 0f 84 a6 00 00 00 4c 8d 2d 99 1c 8f 01 48 63 93 b0 00 00 00 4c 8d 25 bb 86 00 00 49 8b 45 00 48 8b 04 d0 48 8b 40 68 <48> 8b 40 18 48 8b 38 48 8b 07 48 8b 80 e8 00 00 00 4c 39 e0 0f 85

It seems in some conditions test execution doesn't return normally and killed from the outside.

I suggest we temporary disable asserts on HIP backend to minimize the impact on CI system while we investigate the root cause.

@pvchupin
Copy link
Contributor

pvchupin commented Dec 7, 2022

Yes, they've been disabled at intel/llvm-test-suite#1441

@bader
Copy link
Contributor

bader commented Sep 23, 2023

I've added cuda label because Assert/assert_in_multiple_tus_one_ndebug.cpp failed in nightly run - https://github.com/intel/llvm/actions/runs/6281231172/job/17059577474.

@JackAKirk
Copy link
Contributor

JackAKirk commented Sep 28, 2023

There's a few possible causes I've identified so far (for HIP). I also have not been able to reproduce the specific issue though.

Does anyone know which rocm version the CI was using when these tests fails were reported, and which version it is currently using? I find that there is an (with a admittedly different error output) issue with a corresponding hip assert test for rocm4, but rocm 5 versions work. BTW there was another unrelated hip driver issue that meant compiling at O0 failed, which has been fixed in rocm 5.7.0.

Also I see that it is using ubuntu 22.04: you need ROCm 5.3.0 or later to be compatible with 22.04: ROCm/ROCm#1730.

I can see from the CI that it is using a AMD Radeon RX 6700 XT (gfx1031). gfx1031 is not officially supported by ROCm on linux. I think I remember that in the past the CI used a gfx1030 which is officially supported. I don't know whether switching device could have led to some CI issues. But I think it makes sense to use an officially supported AMD device on the CI. It seems that lots of non-officially supported amd gpus work, at least to some degree, but using one for testing seems like a bad idea. Does someone know when the gtx1031 started being used on the CI?

Thanks

@JackAKirk
Copy link
Contributor

Also depending on which subversion of 22.04 you use you may need a later rocm version. e.g. I noticed this:

"

New in version 5.7.0:

Ubuntu 22.04.3 support was added.

"

from https://docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus

@steffenlarsen
Copy link
Contributor

@aelovikov-intel - Do you know the answer to @JackAKirk's question above?

@aelovikov-intel
Copy link
Contributor

For our amdgpu-2 runner (I assume the same should be for the others, but haven't verified):

# ls -d /opt/rocm-*
/opt/rocm-4.5.1

@JackAKirk
Copy link
Contributor

JackAKirk commented Oct 3, 2023

For our amdgpu-2 runner (I assume the same should be for the others, but haven't verified):

# ls -d /opt/rocm-*
/opt/rocm-4.5.1

OK, that version of ROCm isn't supported by any version of ubuntu 22.04 (the CI is using ubuntu 22.04). I suggest upgrading the CI to ROCm 5.7. Also if possible an officially supported ROCm gpu should be used from this list https://docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus.

@jinz2014
Copy link
Contributor

jinz2014 commented Oct 4, 2023

I tried to build and run the test (assert_in_kernels.cpp) on an MI100 GPU.
rocm 5.7 and centos 8

The output message:

./assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [0,0,0], local id: [0,0,0] Assertion Buf[wiID] == 0 && "from assert statement" failed.
./assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [2,0,0], local id: [2,0,0] Assertion Buf[wiID] == 0 && "from assert statement" failed.
:0:rocdevice.cpp :2692: 6204498005977 us: [pid:2254935 tid:0x153e8b359700] Callback: Queue 0x153a70a00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
Aborted (core dumped)

@JackAKirk
Copy link
Contributor

I tried to build and run the test (assert_in_kernels.cpp) on an MI100 GPU. rocm 5.7 and centos 8

The output message:

./assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [0,0,0], local id: [0,0,0] Assertion Buf[wiID] == 0 && "from assert statement" failed. ./assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [2,0,0], local id: [2,0,0] Assertion Buf[wiID] == 0 && "from assert statement" failed. :0:rocdevice.cpp :2692: 6204498005977 us: [pid:2254935 tid:0x153e8b359700] Callback: Queue 0x153a70a00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 Aborted (core dumped)

I think this is the expected behavior right? at least the assert part.

@JackAKirk
Copy link
Contributor

JackAKirk commented Nov 24, 2023

@aelovikov-intel

We think that the hangs could be due to missing pcie atomics in the CI bus.
pcie atomics are stated to be required for rocm:

https://docs.amd.com/en/docs-5.6.0/release/gpu_os_support.html#cpu-support

And we think that assert is one place where pcie atomics are used.

Would it be possible to send us the output of

lspci -vv

on the CI (with sudo) to confirm this?

Thanks

@aelovikov-intel
Copy link
Contributor

aelovikov-intel commented Nov 27, 2023

On amdgpu-3 runner natively (not inside docker image): lspci.txt
lspci_root.txt

@JackAKirk
Copy link
Contributor

On amdgpu-3 runner natively (not inside docker image): lspci.txt lspci_root.txt

Thanks, that's very interesting. The "Internal" pcie is marked negative for all atomics (however other pcie hardware are marked positive), which we guessed would be the relevant hardware for the assert, but we are not 100% sure about this. I've however realized that amdgpu-3 is the runner that passed for all the assert runs I made. This is also a cpu that I had expected to fully support the relevant pcie Atomics.
amdgpu-4 is the one that seems to be leading to the assert timeouts. Would it be possible for you to post a lspci_root.txt for amdgpu-4 also? It would be very useful to compare the output of the two runners.

Am I right that the only two that are used in the amd ci are amdgpu-4 and amdgpu-3?

Many thanks for your help with this.

@aelovikov-intel
Copy link
Contributor

lspci_root_amdgpu-4.txt

@JackAKirk
Copy link
Contributor

lspci_root_amdgpu-4.txt

It seems to be saying that both runners do not support atomics. There is little difference in their output.
Could you also post the output of

lspci -t -vv

for both runners?

Thanks

npmiller referenced this issue Jan 31, 2024
A few tests in the driver area require amdgpu or nvptx targets to be
built in order to properly run. Add these requirements to the tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end hip Issues related to execution on HIP backend.
Projects
None yet
Development

No branches or pull requests

7 participants