[FEA] cuIO Statistics calculation code is redundant #6920
Description
cuIO has common code for statistics calculation between Parquet writer and ORC writer. This uses custom logic to perform reductions across chunks of rows. These chunks of rows are defined by the unit for which the statistics is generated. e.g. pages in case of parquet and stripes in case of ORC.
This can be refactored to use cub::DeviceSegmentedReduce
with a custom iterator that that creates a statistics_val
from each column element and a custom reduce operator that reduces between two statistics_val
s.
We should also think about using input columns' cudf types rather than specially mapped output types to perform the reduction in. Once the reduction is complete, if the format calls for it, we can convert the type while encoding. This will allow us to replace switch cases made for these dtypes (e.g. here here here) with cudf's type dispatcher.
This will have following advantages:
- By using cub's optimized kernels we reduce line count by a lot, thus de-duplicating functionality.
- We can use cudf's standard
DeviceMin
,DeviceMax
, andDeviceSum
operators that define the min/max/sum operators but more importantly, the respective identity for all current and future cudf types.
Concerns:
The current kernel gpuMergeColumnStatistics
is launched only once for the entire table but with cub::DeviceSegmentedReduce
, we'd have one async launch per column. This can be an issue when the table has a high number of columns.
Profiling for feasibility
As per some preliminary profiling, I found that the cub kernel performs faster in case of a single 1GB column as compared to the existing approach.
To predict the effect of launching multiple kernels for columns, I tried launching 64 cub kernels totaling 1GB data. The total resulting time loses to single column scenario but still performs a bit better than the current approach. (3.1 ms vs 5.7 ms)