Open
Description
New RF backend can be considerably slower depending on max_depth
and n_bins
! Initial profiling shows computeSplit
kernels are by far the biggest bottlenecks.
low hanging fruits
new backend exposes the following parameters to be tuned as per the depth and the number of samples available in the current node to be split. They are:
n_blks_for_cols
- number of columns to be simultaneously processed in a singlecomputeSplit
kernel call. This is a trade-off between the amount of memory usage and runtimen_blks_for_rows
- determines thegridDim.x
of thecomputeSplit
kernels.
Sadly, none of these params are being tuned currently and are just hard-coded to some values! We need to tune these params to achieve optimal perf.
near short-term tasks
- Today we are computing the histogram CDF's (in both classification and regression
computeSplit
kernels) by doing shared memory atomics! (here and here). This way of computing CDF's for the purpose of computing metrics will have a lot of atomic bank conflicts! One way to improve perf would be to compute PDF's instead and while computing metrics, we could do a prefix-scan to get the CDF's. - Currently, the temporary workspace memory allocation happens for every tree build in the new RF backend. We should move this logic out of the decisiontree folder and into the randomforest and reuse this workspace across different trees being built in the same cuda stream.