POC: Implement `HOST_UDF` aggregations #17249

ttnghia · 2024-11-05T19:00:12Z

This implements HOST_UDF aggregations, allowing to call aggregations on external implemented functions.

Usage

Define a function with this signature:

std::unique_ptr<cudf::column> compute_aggregation(cudf::column_view const& values,
                                         cudf::device_span<cudf::size_type const> group_offsets,
                                         cudf::device_span<cudf::size_type const> group_labels,
                                         cudf::size_type num_groups,
                                         rmm::cuda_stream_view stream,
                                         rmm::device_async_resource_ref mr);

Make HOST_UDF aggregate instance with parameter is a function pointer pointing to that function:

auto agg = cudf::make_host_udf_aggregation<cudf::groupby_aggregation>(compute_aggregation);

Perform cudf aggregation operations on the created aggregate instance.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia · 2024-11-05T19:04:09Z

Working examples: https://github.com/rapidsai/cudf/pull/17249/files#diff-4997abf87c87e8b0d2382de63631f8671d484174f6bb1dc26453e49e77d395fa

revans2 · 2024-11-05T19:21:17Z

cpp/tests/groupby/host_udf_tests.cu

+    cudf::data_type{cudf::type_id::INT32}, values.size(), cudf::mask_state::UNALLOCATED, stream);
+  thrust::transform(rmm::exec_policy(stream),
+                    thrust::make_counting_iterator(0),
+                    thrust::make_counting_iterator(values.size()),


Could we have some examples that actually do an aggregation? This is producing an output for each input value, but aggregations are supposed to produce an output for each group. Right?

Yes this is just a simple transformation. The input parameters have enough information (group offsets and labels) thus we can easily implement anything that we want on the group values.

I'm adding such examples now.

Now the examples are computing values for each individual group. For example:

For each group: compute (group_idx + 1)* values^2 * 2

ttnghia · 2024-11-05T19:28:06Z

cpp/include/cudf/aggregation.hpp

+using host_udf_func_type = std::function<std::unique_ptr<column>(column_view const&,
+                                                                 device_span<size_type const>,
+                                                                 device_span<size_type const>,
+                                                                 size_type,
+                                                                 rmm::cuda_stream_view,
+                                                                 rmm::device_async_resource_ref)>;


This only passes group values (the first column parameter). It seems that we should better pass the group keys too, to have all the group information needed for generic computation.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

wence- · 2024-11-06T09:24:31Z

cpp/src/groupby/sort/aggregate.cpp

+  // TODO: Add a name string to the aggregation so that we can look up different host UDFs.
+  if (cache.has_result(values, agg)) { return; }


suggestion: Why not ask the implementer of the host udf to provide hash and equality, like the other aggregations have?

Providing a name string should be enough for hashing agg here as we will hash a pair {aggregation::kind, udf_name_str}. That will be much simpler than providing a hash and equality functor.

Sorry I was wrong. Indeed, providing hash and equality operators seems to be the good way to go. I've added the interface for that in my latest idea: 57674e1

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia added 2 commits November 5, 2024 10:55

Implement host udf aggregation

bba150c

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Add test

04e2bda

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia self-assigned this Nov 5, 2024

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Nov 5, 2024

ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress Spark Functionality that helps Spark RAPIDS and removed CMake CMake build issue labels Nov 5, 2024

revans2 reviewed Nov 5, 2024

View reviewed changes

ttnghia commented Nov 5, 2024

View reviewed changes

Change example to compute aggregation on each group

5f7ab2b

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

github-actions bot added the CMake CMake build issue label Nov 5, 2024

wence- reviewed Nov 6, 2024

View reviewed changes

ttnghia added 2 commits November 19, 2024 12:07

Merge branch 'branch-25.02' into host_udf

7c9316a

Add host_udf_base class

57674e1

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

github-actions bot added Python Affects Python cuDF API. Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Nov 19, 2024

ttnghia changed the base branch from branch-24.12 to branch-25.02 November 19, 2024 23:56

ttnghia removed Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Implement `HOST_UDF` aggregations #17249

POC: Implement `HOST_UDF` aggregations #17249

ttnghia commented Nov 5, 2024

ttnghia commented Nov 5, 2024

revans2 Nov 5, 2024

ttnghia Nov 5, 2024

ttnghia Nov 5, 2024

ttnghia Nov 5, 2024 •

edited

Loading

ttnghia Nov 5, 2024 •

edited

Loading

wence- Nov 6, 2024

ttnghia Nov 6, 2024

ttnghia Nov 19, 2024

		// TODO: Add a name string to the aggregation so that we can look up different host UDFs.
		if (cache.has_result(values, agg)) { return; }

POC: Implement HOST_UDF aggregations #17249

Are you sure you want to change the base?

POC: Implement HOST_UDF aggregations #17249

Conversation

ttnghia commented Nov 5, 2024

ttnghia commented Nov 5, 2024

revans2 Nov 5, 2024

Choose a reason for hiding this comment

ttnghia Nov 5, 2024

Choose a reason for hiding this comment

ttnghia Nov 5, 2024

Choose a reason for hiding this comment

ttnghia Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

ttnghia Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

wence- Nov 6, 2024

Choose a reason for hiding this comment

ttnghia Nov 6, 2024

Choose a reason for hiding this comment

ttnghia Nov 19, 2024

Choose a reason for hiding this comment

POC: Implement `HOST_UDF` aggregations #17249

POC: Implement `HOST_UDF` aggregations #17249

ttnghia Nov 5, 2024 •

edited

Loading

ttnghia Nov 5, 2024 •

edited

Loading