Skip to content

[TRACKER] Improve performance of the new RF backend #3527

Open
@teju85

Description

@teju85

New RF backend can be considerably slower depending on max_depth and n_bins! Initial profiling shows computeSplit kernels are by far the biggest bottlenecks.

low hanging fruits

new backend exposes the following parameters to be tuned as per the depth and the number of samples available in the current node to be split. They are:

  1. n_blks_for_cols - number of columns to be simultaneously processed in a single computeSplit kernel call. This is a trade-off between the amount of memory usage and runtime
  2. n_blks_for_rows - determines the gridDim.x of the computeSplit kernels.
    Sadly, none of these params are being tuned currently and are just hard-coded to some values! We need to tune these params to achieve optimal perf.

near short-term tasks

  1. Today we are computing the histogram CDF's (in both classification and regression computeSplit kernels) by doing shared memory atomics! (here and here). This way of computing CDF's for the purpose of computing metrics will have a lot of atomic bank conflicts! One way to improve perf would be to compute PDF's instead and while computing metrics, we could do a prefix-scan to get the CDF's.
  2. Currently, the temporary workspace memory allocation happens for every tree build in the new RF backend. We should move this logic out of the decisiontree folder and into the randomforest and reuse this workspace across different trees being built in the same cuda stream.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CUDA / C++CUDA issuePerfRelated to runtime performance of the underlying codeinactive-30d

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions