Description
openedon Aug 15, 2023
Is your feature request related to a problem? Please describe.
To be clear I am not sure that this is something that CUDF wants or not. If not that is okay we can probably implement something ourselves, but it will take a non-trivial amount of work to do it for group-by.
#4706 appears to be very similar so there could be a lot of overlap between the two, except the dask documentation says that it is "row wise approximate", and has tdigest as a method in addition to dask. So I think they really are doing percentile approx and CUDF already has some support for tdigest. So maybe it is not the same.
Describe the solution you'd like
I would like two new aggregations and a post processing kernel. This is very similar to how TDIGEST
works, except instead of computing approximate percentiles it would calculate a 100% accurate percentile.
We would have a COLLECT_FOR_QUANTILE
aggregation that behave a lot like the TDIGEST
aggregation does. It would collect the information into some intermediate data structure.
Then we want a MERGE_FOR_QUANTILE
aggregation that would behave a lot like the MERGE_TDIGEST
aggregation does. It would take as input the output of COLLECT_FOR_QUANTILE
and merge 1 or more of them together.
Finally we would want a quantile API similar to percentile_approx
that would take the output of the aggregation along with a set of percentiles, and return the desired values. Ideally we would have LINEAR interpolation, at least by default because that is what Spark does.
To describe what Spark does for reference. They will create a map from the value to a count, filtering out nulls before inserting them into the map. That is the internal data structure used by Spark. It would be nice to try and do something similar if possible.
We could, in theory, implement this in terms of COLLECT_LIST
, MERGE_LIST
and then a new quantile/percentile API that would return the result based off of the desired percentiles. But if there are a lot of duplicate values, which we suspect to be common, then the amount of data that we would end up shuffling would be much larger than Spark does, and would slow us down. I don't mind if we want to play games with how the data is encoded, like if all of the counts are 1, we drop the count from the data. But to make that happen the intermediate data returned by MERGE_FOR_QUANTILE
and COLLECT_FOR_QUANTILE
would have to be generic enough to support both (possibly a LIST or something like that).
Metadata
Assignees
Labels
Type
Projects
Status
Done
Activity