Skip to content

[BUG] Thrust 1.12 causes segfault in SVC pytest #3885

Closed
@dantegd

Description

@dantegd

Describe the bug
PR #3844 allows for easily pinning thrust independent of the CTK, but when I pinned it to 1.12 (the latest thrust release) the PR ran into the following segfault:

cuml/test/test_svm.py::test_svm_skl_cmp_decision_function[params0] Fatal Python error: Aborted

Current thread 0x00007f4948292740 (most recent call first):
  File "/home/galahad/RAPIDS/0.20/cuml/fea-rapids-cmake/python/cuml/internals/api_decorators.py", line 409 in inner_with_setters
  File "/home/galahad/RAPIDS/0.20/cuml/fea-rapids-cmake/python/cuml/test/test_svm.py", line 304 in test_svm_skl_cmp_decision_function
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
  File "/home/galahad/miniconda3/envs/ns0520/lib/python3.8/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "/home/galahad/miniconda3/envs/ns0520/bin/pytest", line 11 in <module>
[1]    72354 abort (core dumped)  pytest cuml/test/test_svm.py::test_svm_skl_cmp_decision_function -v

I traced the error to the call of SVC.fit in

cuSVC.fit(X_train, y_train)
, so creating a small reproducer shouldn't be hard.
A quick run of that pytest with cuda-gdb threw the following error:

=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/galahad/RAPIDS/0.20/cuml/fea-rapids-cmake/python, configfile: pytest.ini
plugins: benchmark-3.4.1, hypothesis-6.13.0, xdist-2.2.1, cov-2.12.0, timeout-1.4.2, anyio-3.1.0, asyncio-0.12.0, forked-1.3.0
collected 2 items                                                                                                                                                                                                 

cuml/test/test_svm.py [New Thread 0x7fff38634700 (LWP 74042)]
[New Thread 0x7fff40e35700 (LWP 74043)]
warning: Cuda API error detected: cudaMemsetAsync returned (0x1)

warning: Cuda API error detected: cudaGetLastError returned (0x1)


CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x555559a87900

Thread 1 "python" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 40, block (0,0,0), thread (192,0,0), device 0, sm 0, warp 5, lane 0]
0x0000555559a87910 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<thrust::device_ptr<bool>, thrust::device_ptr<int> >, thrust::device_ptr<bool>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<bool>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<thrust::device_ptr<bool>, thrust::device_ptr<int> >, thrust::device_ptr<bool>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<bool>, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::cuda_cub::__transform::unary_transform_f<thrust::permutation_iterator<thrust::device_ptr<bool>, thrust::device_ptr<int> >, thrust::device_ptr<bool>, thrust::cuda_cub::__transform::no_stencil_tag, thrust::identity<bool>, thrust::cuda_cub::__transform::always_true_predicate>, long)<<<(6,1,1),(256,1,1)>>> ()

Steps/Code to reproduce bug
Easiest way is to use PR #3844 or wait until it is merged and then change the thrust version in cpp/cmake/thirdparty/get_thrust to 1.12 and build libcuml++ with it.

Expected behavior
Pytests passing with no issue, also this could be a blocker to upgrade thrust.

Environment details (please complete the following information):

  • Environment location: CI and bare metal
  • Linux Distro/Architecture: 20.04 and Centos 7
  • GPU Model/Driver: 3080 and V11
  • CUDA: 11.2 using thrust 1.12
  • Method of cuDF & cuML install: from source

cc @tfeher who might be able to diagnose or triage this much faster than myself

Metadata

Metadata

Assignees

Labels

? - Needs TriageNeed team to review and classifybugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions