Description
The problem is seen once per 20 runs. NVidia card goes into faulty state after hang.
e.g.
- http://ci.llvm.intel.com:8010/#/builders/37/builds/1505 - timeout LIT
- http://ci.llvm.intel.com:8010/#/builders/37/builds/1506 - faulty NVIDIA card
The next test job shows the following status:
{noformat}
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN RTX On | 00000000:01:00.0 Off | N/A |
|ERR! 48C P0 ERR! / 280W | 288MiB / 24190MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 31576 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 123MiB |
| 0 31586 C - 19MiB |
| 0 31593 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 12MiB |
+-----------------------------------------------------------------------------+
{noformat}
The other faulty processes for the last week:
| 0 28178 C ...s/buffer/Output/reinterpret.cpp.tmp.out 123MiB |
| 0 21227 C ...c_tests/Output/device_event.cpp.tmp.run 123MiB |
| 0 5074 C ...c_tests/Output/device_event.cpp.tmp.run 123MiB |
| 0 9151 C ...sts/Output/access_to_subset.cpp.tmp.out 123MiB |
| 0 30549 C ...ls/sycl/unittests/pi/cuda/./PiCudaTests 123MiB |