Skip to content

[FEA] distributed quantile group by aggregations and reductions #13885

Closed

Description

Is your feature request related to a problem? Please describe.

To be clear I am not sure that this is something that CUDF wants or not. If not that is okay we can probably implement something ourselves, but it will take a non-trivial amount of work to do it for group-by.

#4706 appears to be very similar so there could be a lot of overlap between the two, except the dask documentation says that it is "row wise approximate", and has tdigest as a method in addition to dask. So I think they really are doing percentile approx and CUDF already has some support for tdigest. So maybe it is not the same.

Describe the solution you'd like
I would like two new aggregations and a post processing kernel. This is very similar to how TDIGEST works, except instead of computing approximate percentiles it would calculate a 100% accurate percentile.

We would have a COLLECT_FOR_QUANTILE aggregation that behave a lot like the TDIGEST aggregation does. It would collect the information into some intermediate data structure.

Then we want a MERGE_FOR_QUANTILE aggregation that would behave a lot like the MERGE_TDIGEST aggregation does. It would take as input the output of COLLECT_FOR_QUANTILE and merge 1 or more of them together.

Finally we would want a quantile API similar to percentile_approx that would take the output of the aggregation along with a set of percentiles, and return the desired values. Ideally we would have LINEAR interpolation, at least by default because that is what Spark does.

To describe what Spark does for reference. They will create a map from the value to a count, filtering out nulls before inserting them into the map. That is the internal data structure used by Spark. It would be nice to try and do something similar if possible.

We could, in theory, implement this in terms of COLLECT_LIST, MERGE_LIST and then a new quantile/percentile API that would return the result based off of the desired percentiles. But if there are a lot of duplicate values, which we suspect to be common, then the amount of data that we would end up shuffling would be much larger than Spark does, and would slow us down. I don't mind if we want to play games with how the data is encoded, like if all of the counts are 1, we drop the count from the data. But to make that happen the intermediate data returned by MERGE_FOR_QUANTILE and COLLECT_FOR_QUANTILE would have to be generic enough to support both (possibly a LIST or something like that).

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentSparkFunctionality that helps Spark RAPIDSfeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions