Skip to content

[FEA] cuIO Statistics calculation code is redundant #6920

Closed
@devavret

Description

cuIO has common code for statistics calculation between Parquet writer and ORC writer. This uses custom logic to perform reductions across chunks of rows. These chunks of rows are defined by the unit for which the statistics is generated. e.g. pages in case of parquet and stripes in case of ORC.

This can be refactored to use cub::DeviceSegmentedReduce with a custom iterator that that creates a statistics_val from each column element and a custom reduce operator that reduces between two statistics_vals.

We should also think about using input columns' cudf types rather than specially mapped output types to perform the reduction in. Once the reduction is complete, if the format calls for it, we can convert the type while encoding. This will allow us to replace switch cases made for these dtypes (e.g. here here here) with cudf's type dispatcher.

This will have following advantages:

  1. By using cub's optimized kernels we reduce line count by a lot, thus de-duplicating functionality.
  2. We can use cudf's standard DeviceMin, DeviceMax, and DeviceSum operators that define the min/max/sum operators but more importantly, the respective identity for all current and future cudf types.

Concerns:

The current kernel gpuMergeColumnStatistics is launched only once for the entire table but with cub::DeviceSegmentedReduce, we'd have one async launch per column. This can be an issue when the table has a high number of columns.

Profiling for feasibility

As per some preliminary profiling, I found that the cub kernel performs faster in case of a single 1GB column as compared to the existing approach.
comparison
To predict the effect of launching multiple kernels for columns, I tried launching 64 cub kernels totaling 1GB data. The total resulting time loses to single column scenario but still performs a bit better than the current approach. (3.1 ms vs 5.7 ms)
Screenshot 2020-12-05 at 1 46 50 AM

Metadata

Labels

cuIOcuIO issuelibcudfAffects libcudf (C++/CUDA) code.proposalChange current process or code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions