Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a shortcut for when the input clusters are all empty for the tdigest merge #16897

Merged
merged 8 commits into from
Oct 1, 2024
Prev Previous commit
Next Next commit
improve docs
  • Loading branch information
jihoonson committed Sep 25, 2024
commit bc1f712972bcd9e71cf5dd445d77ba9601378a55
21 changes: 10 additions & 11 deletions cpp/src/quantiles/tdigest/tdigest_aggregation.cu
Original file line number Diff line number Diff line change
Expand Up @@ -1053,16 +1053,15 @@ struct group_key_func {
*
* A tdigest cluster can be empty in the input, which means that there was no valid input data to
* generate that cluster. These empty clusters are currently stored differently in different parts
* of the tdigest column. They are "compressed" in the `means`, `weights`, and `offsets` columns,
* but not in the `min` and `max` columns.
* of the tdigest column. They are more explicit in the `min` and `max` columns than in the `means`,
* `weights` columns.
* - The `means` and `weights` columns do not contain values for empty clusters.
* - The empty clusters are stored as two consecutive same values. For example, given an offsets
* column of (0, 1, 1, 2), the second cluster where its offset is 1 is empty. Note that the offsets
* are the offsets for the means and the weights.
* - In the `offsets` column for the means and weights, the offsets to empty clusters are
* stored so that their size is 0. For example, given an offsets column of (0, 1, 1, 2),
* the second cluster where its both start and end offsets are 1 is empty.
* - The `min` and `max` columns contain 0s for the empty clusters.
*
* @param tdv input tdigests. These should have been sorted by the key, but may have not by the
* centroid mean within each group.
* @param tdv input tdigests. The tdigests within this column are grouped by key.
* @param h_group_offsets a host iterator of the offsets to the start of each group. A group is
* counted as one even when the cluster is empty in it. The offsets should have the same values as
* the ones in `group_offsets`.
Expand All @@ -1073,7 +1072,7 @@ struct group_key_func {
* empty clusters.
* @param num_group_labels the number of unique group labels.
* @param num_groups the number of groups.
* @param max_centroids the maximum number of centroids (clusters) in the tdigest.
* @param max_centroids the maximum number of centroids (clusters) in the output (merged) tdigest.
* @param stream CUDA stream
* @param mr device memory resource
*
Expand All @@ -1090,9 +1089,9 @@ std::unique_ptr<column> merge_tdigests(tdigest_column_view const& tdv,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)
{
// Sort the tdigests by the centroid mean within each group and then pass them to
// `compute_tdigests()`. Note that the input has been sorted by the key, but has not by the mean
// yet within each group. For sorting by the key, see
// The core logic of `merge_tdigests()` is to sort the tdigests by the centroid mean within each
// group and then pass them to `compute_tdigests()`. Note that the individual tdigests in the
// input have been grouped together by key. For grouping the input by key, see
// `store_result_functor::get_grouped_values()`.
//
// NOTE: the current implementation is quite complex and involves offset copy from the device to
Expand Down