Skip to content

[BUG] SIGABRT in CUML RF, out of bounds memory usage #4046

Closed
@pseudotensor

Description

@pseudotensor

Describe the bug
SIGABRT, seemingly from out of bounds

Steps/Code to reproduce bug

Unknown, but paraameters were just Kaggle Paribas with some various Frequency encoding features to get to (91457, 331) size.

parameters

 OrderedDict([('output_type', 'numpy'), ('random_state', 840607124), ('verbose', False), ('n_estimators', 200), ('n_bins', 128), ('split_criterion', 1), ('max_depth', 18), ('max_leaves', 1024), ('max_features', 'auto'), ('min_samples_leaf', 1), ('min_samples_split', 10), ('min_impurity_decrease', 0.0)])

For a binary classification problem.

No messages in console at all, even though ran in debug mode with verbose=4. All I got was SIGABRT and in dmesg this:

[Sun Jul 11 21:15:41 2021] NVRM: GPU at PCI:0000:01:00: GPU-0bb167f8-b3cd-8df7-9644-d5f95716e554
[Sun Jul 11 21:15:41 2021] NVRM: GPU Board Serial Number: 
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=2041, Graphics SM Warp Exception on (GPC 3, TPC 3, SM 0): Out Of Range Address
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=2041, Graphics SM Global Exception on (GPC 3, TPC 3, SM 0): Multiple Warp Errors
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=2041, Graphics Exception: ESR 0x51df30=0xc13000e 0x51df34=0x24 0x51df28=0x4c1eb72 0x51df2c=0x174
[Sun Jul 11 21:15:41 2021] NVRM: Xid (PCI:0000:01:00): 43, pid=6304, Ch 00000088
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=6304, Graphics SM Warp Exception on (GPC 4, TPC 2, SM 1): Out Of Range Address
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=6304, Graphics SM Global Exception on (GPC 4, TPC 2, SM 1): Multiple Warp Errors
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 13, pid=6304, Graphics Exception: ESR 0x5257b0=0xc12000e 0x5257b4=0x24 0x5257a8=0x4c1eb72 0x5257ac=0x174
[Sun Jul 11 21:15:54 2021] NVRM: Xid (PCI:0000:01:00): 43, pid=8874, Ch 00000088

Expected behavior

Not to crash, be more stable.

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Linux Distro/Architecture: Ubuntu 18.04LTS
  • GPU Model/Driver: RTX2080 460.80
  • CUDA: 11.2.2
  • Method of cuDF & cuML install: conda nightly 21.08 -- nightly as of 7 days ago.

conda_list.txt.zip

Additional context

If hit again will try to produce repro. But I expect just various testing on NVIDIA's side will reveal. I've only been using CUML RF for a day and already hit this after (maybe) 200 fits on small data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions