Skip to content

LIT tests are hanging on CUDA sporadically #1919

Closed
@vladimirlaz

Description

@vladimirlaz

The problem is seen once per 20 runs. NVidia card goes into faulty state after hang.

e.g.

{noformat}
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN RTX On | 00000000:01:00.0 Off | N/A |
|ERR! 48C P0 ERR! / 280W | 288MiB / 24190MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 31576 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 123MiB |
| 0 31586 C - 19MiB |
| 0 31593 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 12MiB |
+-----------------------------------------------------------------------------+
{noformat}

The other faulty processes for the last week:
| 0 28178 C ...s/buffer/Output/reinterpret.cpp.tmp.out 123MiB |
| 0 21227 C ...c_tests/Output/device_event.cpp.tmp.run 123MiB |
| 0 5074 C ...c_tests/Output/device_event.cpp.tmp.run 123MiB |
| 0 9151 C ...sts/Output/access_to_subset.cpp.tmp.out 123MiB |
| 0 30549 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 123MiB |

Metadata

Metadata

Assignees

No one assigned

    Labels

    cudaCUDA back-end

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions