[FEA] cuIO Statistics calculation code is redundant

cuIO has [common code for statistics calculation](https://github.com/rapidsai/cudf/blob/branch-0.17/cpp/src/io/statistics/column_stats.cu) between Parquet writer and ORC writer. This uses custom logic to perform reductions across chunks of rows. These chunks of rows are defined by the unit for which the statistics is generated. e.g. pages in case of parquet and stripes in case of ORC.

This can be refactored to use `cub::DeviceSegmentedReduce` with a custom iterator that that creates a `statistics_val` from each column element and a custom reduce operator that reduces between two `statistics_val`s.

We should also think about using input columns' cudf types rather than [specially mapped output types](https://github.com/rapidsai/cudf/blob/8cc23bdd4f894c85d5ee400712db994711244b3d/cpp%2Fsrc%2Fio%2Fstatistics%2Fcolumn_stats.h#L24-L38) to perform the reduction in. Once the reduction is complete, if the format calls for it, we can convert the type while encoding. This will allow us to replace switch cases made for these dtypes (e.g. [here](https://github.com/rapidsai/cudf/blob/1771a8fc18ac647c9172209fa3b39b40437ac85c/cpp%2Fsrc%2Fio%2Fstatistics%2Fcolumn_stats.cu#L173-L178) [here](https://github.com/rapidsai/cudf/blob/8cc23bdd4f894c85d5ee400712db994711244b3d/cpp%2Fsrc%2Fio%2Forc%2Fstats_enc.cu#L259) [here](https://github.com/rapidsai/cudf/blob/0cc1fff48886bd70751bd141cafae2c61f89dda0/cpp%2Fsrc%2Fio%2Fparquet%2Fpage_enc.cu#L1424-L1435)) with cudf's type dispatcher.

#### This will have following advantages:
1. By using cub's optimized kernels we reduce line count by a lot, thus de-duplicating functionality.
2. We can use cudf's standard `DeviceMin`, `DeviceMax`, and `DeviceSum` operators that define the min/max/sum operators but more importantly, the respective identity for all current and future cudf types.

#### Concerns:
The current kernel `gpuMergeColumnStatistics` is launched only once for the entire table but with `cub::DeviceSegmentedReduce`, we'd have one **async** launch per column. This can be an issue when the table has a high number of columns.

#### Profiling for feasibility
As per some preliminary profiling, I found that the cub kernel performs faster in case of a single 1GB column as compared to the existing approach.
![comparison](https://user-images.githubusercontent.com/3027195/101216641-de2e1d80-36a5-11eb-837e-7cda5f08a956.png)
To predict the effect of launching multiple kernels for columns, I tried launching 64 cub kernels totaling 1GB data. The total resulting time loses to single column scenario but still performs a bit better than the current approach. (3.1 ms vs 5.7 ms)
![Screenshot 2020-12-05 at 1 46 50 AM](https://user-images.githubusercontent.com/3027195/101216884-51d02a80-36a6-11eb-8406-42e0cbae14b3.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] cuIO Statistics calculation code is redundant #6920

This will have following advantages:

Concerns:

Profiling for feasibility

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development