[FEA] distributed quantile group by aggregations and reductions

**Is your feature request related to a problem? Please describe.**

To be clear I am not sure that this is something that CUDF wants or not. If not that is okay we can probably implement something ourselves, but it will take a non-trivial amount of work to do it for group-by.

https://github.com/rapidsai/cudf/issues/4706 appears to be very similar so there could be a lot of overlap between the two, except the dask documentation says that it is "row wise approximate", and has tdigest as a method in addition to dask. So I think they really are doing percentile approx and CUDF already has some support for tdigest. So maybe it is not the same.

**Describe the solution you'd like**
I would like two new aggregations and a post processing kernel. This is very similar to how `TDIGEST` works, except instead of computing approximate percentiles it would calculate a 100% accurate percentile.  

We would have a `COLLECT_FOR_QUANTILE` aggregation that behave a lot like the `TDIGEST` aggregation does. It would collect the information into some intermediate data structure.

Then we want a `MERGE_FOR_QUANTILE` aggregation that would behave a lot like the `MERGE_TDIGEST` aggregation does. It would take as input the output of `COLLECT_FOR_QUANTILE` and merge 1 or more of them together.

Finally we would want a quantile API similar to `percentile_approx` that would take the output of the aggregation along with a set of percentiles, and return the desired values. Ideally we would have LINEAR interpolation, at least by default because that is what Spark does.

To describe what Spark does for reference. They will create a map from the value to a count, filtering out nulls before inserting them into the map. That is the internal data structure used by Spark. It would be nice to try and do something similar if possible.

We could, in theory, implement this in terms of `COLLECT_LIST`, `MERGE_LIST` and then a new quantile/percentile API that would return the result based off of the desired percentiles.  But if there are a lot of duplicate values, which we suspect to be common, then the amount of data that we would end up shuffling would be much larger than Spark does, and would slow us down.  I don't mind if we want to play games with how the data is encoded, like if all of the counts are 1, we drop the count from the data.  But to make that happen the intermediate data returned by `MERGE_FOR_QUANTILE` and `COLLECT_FOR_QUANTILE` would have to be generic enough to support both (possibly a LIST<BYTE> or something like that).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] distributed quantile group by aggregations and reductions #13885

revans2
openedon Aug 15, 2023

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] distributed quantile group by aggregations and reductions #13885

Description

revans2openedon Aug 15, 2023

Activity

Metadata