Closed
Description
Describe the bug
PR #3844 allows for easily pinning thrust independent of the CTK, but when I pinned it to 1.12 (the latest thrust release) the PR ran into the following segfault:
cuml/test/test_svm.py::test_svm_skl_cmp_decision_function[params0] Fatal Python error: Aborted
Current thread 0x00007f4948292740 (most recent call first):
File "/home/galahad/RAPIDS/0.20/cuml/fea-rapids-cmake/python/cuml/internals/api_decorators.py", line 409 in inner_with_setters
File "/home/galahad/RAPIDS/0.20/cuml/fea-rapids-cmake/python/cuml/test/test_svm.py", line 304 in test_svm_skl_cmp_decision_function
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/config/__init__.py", line 185 in console_main
File "/home/galahad/miniconda3/envs/ns0520/bin/pytest", line 11 in <module>
[1] 72354 abort (core dumped) pytest cuml/test/test_svm.py::test_svm_skl_cmp_decision_function -v
I traced the error to the call of SVC.fit
in
cuml/python/cuml/test/test_svm.py
Line 304 in 05124c4
A quick run of that pytest with cuda-gdb threw the following error:
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/galahad/RAPIDS/0.20/cuml/fea-rapids-cmake/python, configfile: pytest.ini
plugins: benchmark-3.4.1, hypothesis-6.13.0, xdist-2.2.1, cov-2.12.0, timeout-1.4.2, anyio-3.1.0, asyncio-0.12.0, forked-1.3.0
collected 2 items
cuml/test/test_svm.py [New Thread 0x7fff38634700 (LWP 74042)]
[New Thread 0x7fff40e35700 (LWP 74043)]
warning: Cuda API error detected: cudaMemsetAsync returned (0x1)
warning: Cuda API error detected: cudaGetLastError returned (0x1)
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x555559a87900
Thread 1 "python" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 40, block (0,0,0), thread (192,0,0), device 0, sm 0, warp 5, lane 0]
0x0000555559a87910 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<thrust::device_ptr<bool>, thrust::device_ptr<int> >, thrust::device_ptr<bool>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<bool>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<thrust::device_ptr<bool>, thrust::device_ptr<int> >, thrust::device_ptr<bool>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<bool>, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<thrust::device_ptr<bool>, thrust::device_ptr<int> >, thrust::device_ptr<bool>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<bool>, thrust::cuda_cub::__transform::always_true_predicate>, long)<<<(6,1,1),(256,1,1)>>> ()
Steps/Code to reproduce bug
Easiest way is to use PR #3844 or wait until it is merged and then change the thrust version in cpp/cmake/thirdparty/get_thrust
to 1.12 and build libcuml++
with it.
Expected behavior
Pytests passing with no issue, also this could be a blocker to upgrade thrust.
Environment details (please complete the following information):
- Environment location: CI and bare metal
- Linux Distro/Architecture: 20.04 and Centos 7
- GPU Model/Driver: 3080 and V11
- CUDA: 11.2 using thrust 1.12
- Method of cuDF & cuML install: from source
cc @tfeher who might be able to diagnose or triage this much faster than myself