ARROW-11591: [C++][Compute] Grouped aggregation #9621

bkietz · 2021-03-02T21:42:23Z

This patch adds basic building blocks for grouped aggregation:

Grouper for producing integer arrays encoding group id from batches of keys
HashAggregateKernel for consuming batches of arguments and group ids, updating internal sums/counts/...

For testing purposes, a one-shot grouped aggregation function is provided:

std::shared_ptr<arrow::Array> needs_sum = ...;
std::shared_ptr<arrow::Array> needs_min_max = ...;
std::shared_ptr<arrow::Array> key_0 = ...;
std::shared_ptr<arrow::Array> key_1 = ...;

ARROW_ASSIGN_OR_RAISE(arrow::Datum out,
  arrow::compute::internal::GroupBy({
    needs_sum,
    needs_min_max,
  }, {
    key_0,
    key_1,
  }, {
    {"sum", nullptr},  // first argument will be summed
    {"min_max", &min_max_options},  // second argument's extrema will be found
}));

// Unpack struct array result (a four-field array)
auto out_array = out.array_as<StructArray>();
std::shared_ptr<arrow::Array> sums = out_array->field(0);
std::shared_ptr<arrow::Array> mins_and_maxes = out_array->field(1);
std::shared_ptr<arrow::Array> group_key_0 = out_array->field(2);
std::shared_ptr<arrow::Array> group_key_1 = out_array->field(3);

wesm

Before digging into the details too much, my main issue with what I see is that I don't agree with making hash aggregation a callable function through CallFunction.

In the context of a query engine, the interface for this operator looks something like:

class ExecNode {
 public:
  virtual void Push(int node_index, const ExecBatch& batch) = 0;
  virtual void PushDone(int node_index) = 0;
};

class HashAggregationNode : public ExecNode {
  ...
};

(some query engines use a "pull"-based model, in which the data flow is inverted — there are pros and cons to both approaches, see https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf)

If you want a simple one-shot version of the algorithm (rather than a general "streaming" one like the above), then you can break the input data into record batches of the desired size (e.g. 4K - 64K rows, depending on heuristics) and then push the chunks into the node that you create (note that the HashAggregationNode should push its result into a terminal "OutputNode" when you invoke hash_agg_node->PushDone(0)).

The API can be completely crude / preliminary, but would it be possible to use the query-engine-type approach for this? I think it would be best to start taking some strides in this direction rather than bolting this onto the array function execution machinery which doesn't make sense in a query processing context (because aggregation is fundamentally a streaming algorithm)

On the hash aggregation functions themselves, perhaps it makes sense to add a HASH_AGGREGATE function type and define the kernel interface for these functions, then look up these functions using the general dispatch machinery?

cpp/src/arrow/compute/kernels/aggregate_basic.cc

github-actions · 2021-03-02T23:39:56Z

https://issues.apache.org/jira/browse/ARROW-11591

wesm · 2021-03-03T02:33:49Z

Regarding the HASH_AGGREGATE function type, one of the inputs on each invocation should be the current hash table cardinality so you do not need to inspect the group ids to infer the cardinality.

pitrou

This is very interesting. Here is an assorted round of comments :-)

cpp/src/arrow/compute/api_aggregate.h

cpp/src/arrow/compute/exec.cc

cpp/src/arrow/compute/kernels/aggregate_test.cc

cpp/src/arrow/compute/kernels/aggregate_benchmark.cc

pitrou · 2021-03-04T13:17:34Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

+                                     groupings.value_offsets(), sorted.make_array());
+}
+
+struct ScalarVectorToArray {


Looks like it would be nice to have AppendScalar methods on common builders?
(perhaps even as a virtual function on the base builder class)

https://issues.apache.org/jira/browse/ARROW-11932

cpp/src/arrow/compute/kernels/aggregate_basic.cc

pitrou · 2021-03-04T16:02:08Z

cpp/src/arrow/compute/kernels/aggregate_basic.cc

+  std::vector<uint8_t*> key_buf_ptrs_;
+  std::vector<uint32_t> group_ids_batch_;
+
+  std::unordered_map<std::string, uint32_t> map_;


Add comments for the non-trivial members here?

cpp/src/arrow/compute/kernels/aggregate_basic.cc

pitrou · 2021-03-04T16:11:32Z

cpp/src/arrow/compute/kernels/aggregate_basic.cc

+  }
+  std::vector<int32_t> offsets_batch_;
+  std::vector<uint8_t> key_bytes_batch_;
+  std::vector<uint8_t*> key_buf_ptrs_;


As an optimization, we may also want a std::vector<int64_t> key_null_counts_... though this may not be beneficial.

IIUC, this would be supported by using the "count" aggregation

cpp/src/arrow/compute/kernels/aggregate_basic.cc

nealrichardson · 2021-03-04T22:45:03Z

With an assist from @bkietz, I've written a very basic R wrapper that exercises this in aa530cb. It's enough to expose some issues to address, to say nothing of the interface questions.

library(arrow)
library(dplyr)

# The commit uses this option to switch to use the group_by compute function
options(arrow.summarize = TRUE)
# If the Arrow aggregation function isn't implemented, or if the Arrow call errors,
# it falls back to pulling the data in R and evaluating in R.

# mtcars is a standard dataset that ships with R
mt <- Table$create(mtcars)
mt %>%
  group_by(cyl) %>%
  summarize(total_hp = sum(hp))
# Warning: Error : NotImplemented: Key of typedouble
# ../src/arrow/compute/function.cc:178  kernel_ctx.status()
# ; pulling data into R
# # A tibble: 3 x 2
#     cyl total_hp
# * <dbl>    <dbl>
# 1     4      909
# 2     6      856
# 3     8     2929

# That's unfortunate. R blurs the distinction for users between integer and double,
# so it's not uncommon to have integer data stored as a float.
# (Also, the error message is missing some whitespace.)

# We can cast that to an integer and try again

mt$cyl <- mt$cyl$cast(int32())
unique(mt$cyl)
# Array
# <int32>
# [
#   6,
#   4,
#   8
# ]

mt %>%
  group_by(cyl) %>%
  summarize(total_hp = sum(hp))
# StructArray
# <struct<: double, : int32>>
# -- is_valid: all not null
# -- child 0 type: double
#   [
#     856,
#     909,
#     2929
#   ]
# -- child 1 type: int64
#   [
#     17179869190,
#     8,
#     0
#   ]

# Alright, it computed and got the same numbers, but the StructArray
# is not valid. Type says int32 but data says int64 and we have misplaced bits

# Let's try a different stock dataset
ir <- Table$create(iris)
ir %>%
  group_by(Species) %>%
  summarize(total_length = sum(Sepal.Length))
# Warning: Error : NotImplemented: Key of typedictionary<values=string, indices=int8, ordered=0>
# ../src/arrow/compute/function.cc:178  kernel_ctx.status()
# ; pulling data into R
# # A tibble: 3 x 2
#   Species    total_length
# * <fct>             <dbl>
# 1 setosa             250.
# 2 versicolor         297.
# 3 virginica          329.

# Hmm. dictionary types really need to be supported.
# Let's work around and cast it to string

ir$Species <- ir$Species$cast(utf8())
unique(ir$Species)
# Array
# <string>
# [
#   "setosa",
#   "versicolor",
#   "virginica"
# ]
ir %>%
  group_by(Species) %>%
  summarize(total_length = sum(Sepal.Length))
# Warning: Error : Invalid: Negative buffer resize: -219443965
# ../src/arrow/buffer.cc:262  buffer->Resize(size)
# ../src/arrow/compute/kernels/aggregate_basic.cc:1005  (_error_or_value9).status()
# ../src/arrow/compute/function.cc:193  executor->Execute(implicitly_cast_args, listener.get())
# ; pulling data into R
# # A tibble: 3 x 2
#   Species    total_length
# * <chr>             <dbl>
# 1 setosa             250.
# 2 versicolor         297.
# 3 virginica          329.

michalursa · 2021-03-05T04:55:33Z

Before digging into the details too much, my main issue with what I see is that I don't agree with making hash aggregation a callable function through CallFunction.

In the context of a query engine, the interface for this operator looks something like:
class ExecNode {
 public:
  virtual void Push(int node_index, const ExecBatch& batch) = 0;
  virtual void PushDone(int node_index) = 0;
};

class HashAggregationNode : public ExecNode {
  ...
}; 
(some query engines use a "pull"-based model, in which the data flow is inverted — there are pros and cons to both approaches, see https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf)

If you want a simple one-shot version of the algorithm (rather than a general "streaming" one like the above), then you can break the input data into record batches of the desired size (e.g. 4K - 64K rows, depending on heuristics) and then push the chunks into the node that you create (note that the HashAggregationNode should push its result into a terminal "OutputNode" when you invoke hash_agg_node->PushDone(0)).

The API can be completely crude / preliminary, but would it be possible to use the query-engine-type approach for this? I think it would be best to start taking some strides in this direction rather than bolting this onto the array function execution machinery which doesn't make sense in a query processing context (because aggregation is fundamentally a streaming algorithm)

On the hash aggregation functions themselves, perhaps it makes sense to add a HASH_AGGREGATE function type and define the kernel interface for these functions, then look up these functions using the general dispatch machinery?

I like these points. Let me break into subtopics what I think we are facing now and what we should think about for the future. I read the comment above as advocating bringing support for two concepts: a) relational operators (that are building blocks of a query execution pipeline or a more general query execution plan) and b) pipelines / streaming processing.

1. How we got here

Part of the problem is that we don't have a problem just yet. We didn't bite a big enough chunk of work to force us to do the appropriate refactoring. And the problem I am referring to is that of pipelining, which we do not have support for yet. From what I understand, currently we cannot bind together multiple operations into a single processing pipeline executed using a single Arrow function call, computing expressions on the fly without persisting their results for an entire set of rows.

Related to the pipelining is the fact that in this PR we do not stream the group by output. That way, at some level of abstraction (squinting eyes) we can treat it as a scalar aggregate (we output a single item which happens to be an entire collection of arrays; variants make it possible).

But to me the bigger problem here is that we are mixing together two separate worlds: scalar operators and relational operators and that muddies the general picture. That mixing happens as a consequence of treating the group by as a modified scalar aggregate in the code.

** 2. Scalar operators vs relational operators**

One way to think about it is that scalar expressions (compositions of scalar operators) are like C++ lambdas (e.g. comparison of two values) provided in a call to C++ STL algorithms / containers (e.g. std::sort) while relational operators correspond to these algorithms / containers (except that they additionally have support for composability - creating execution pipelines / trees / DAGs).

The point of confusion may come from the fact that once you vectorize scalar expressions (which you probably want to do, for performance reasons), their interfaces start looking very similar (if not the same) as would some special relational operators (namely: filter, project and scalar aggregate). I claim that current kernel executor classes are relational operators in disguise - project (aka compute scalar) and reduce (aka scalar aggregate).
Relational operators inside current group by
Interestingly, group by present in this PR, can itself be treated as a DAG of multiple relational operators with some kind of pipelines connecting them. The input batch is first processed by the code that assigns integer group id to every row based on key columns. The output of that is then processed by zero, one or more group by aggregate operators that update aggregate related accumulators in their internal arrays. At the end of the input, the output array related to group id mapping component is concatenated to output arrays for individual group by aggregate operators to produce output collection of arrays. We can treat group id mapping and group by aggregates as separate relational operators or we can choose to treat them as internals of a single hash group by operator. Even hash computation for hash table lookup (part of group id mapping) can be treated as a separate processing step that is using a projection operator.

When we talk about relational operators and their connections we can talk at different levels of granularity. Sometimes it’s simpler to treat building blocks made of multiple operators with fixed, hard-coded connections between them as a single operator, sometimes it brings more opportunities for code reusability to treat smaller blocks separately.

** 3. Push and pull**

My personal feeling is that the pull model was good in the early query execution engines, based on processing of a single row at a time and using virtual function calls to switch between relational operators within the query. In my experience, the push model is easier to work with in both modern worlds of query execution: JIT compiled query processing and vectorized query processing.

It may seem abstract right now, which model is better push or pull, but I believe that once you have to actually solve an existing problem it becomes more obvious which one you prefer.

From my experience, in a vectorized execution model, it is easy to adapt existing push-based implementation to support pull-based interface. I am guessing that the same would be true the other way around.

Also, intuitively, when we think about consuming input, we think: pull (e.g. scanning in-memory data or a file), and when we think about producing output, we think: push (e.g. sending results of the computation over the network).

I would probably recommend: at a lower level - use whatever model results in a more readable, simpler, more natural code - and at a higher level - adapt all interfaces to push model.

** 4. When streaming output is not desired**

It may be a bit forward looking, but in some cases it is beneficial to give one relational operator direct access to internal row storage of another relational operator instead of always assuming that the operators stream their output. Streaming output should always be supported, but in addition to that, the direct access option may be useful.

It’s not unusual to request the output of group by to be sorted on group by key columns. In that case the sort operator, which needs to accumulate all of the input in the buffer before sorting, could work directly on the group by buffers without converting internal row storage of group by to batches and then batches back to internal row storage of sort.

Similarly, window functions, quantiles, may require random access to the entire set of input rows (within a group / partition of data). In this case, again, they may want to work on an internal storage of rows of sort rather than streaming it out and accumulating the output stream in internal buffers.

** 5. Summary**

I think that what we are talking about here has a goal that overlaps with group by but is wider in scope and somewhat separate and that is: a) refactoring the code to bring the concepts of relational and scalar operators, b) laying ground for the support of pipelining.

bkietz · 2021-03-17T21:51:00Z

@pitrou PTAL

nealrichardson

All of the issues I previously identified in #9621 (comment) have been resolved

…cute

pitrou

Very nice. Here are a bunch of comments.

cpp/src/arrow/buffer_builder.h

pitrou · 2021-03-22T17:34:40Z

cpp/src/arrow/compute/api_aggregate.h


+namespace internal {
+
+/// Internal use only: streaming group identifier.


If it's internal, why is it exposed in api_aggregate.h? I would expect another header, e.g. compute/group_by_internal.h.

These are made available for testing from R, which could not access an _internal header (since it wouldn't be installed)

Hmm, fair enough. But what's the point of calling those from R?

Testing purposes--we found a bunch of issues in earlier iterations by experimenting with this in R.

It allowed @nealrichardson to explore the hash aggregate kernels and expose a number of issues. We'll probably remove GroupBy altogether in ARROW-12010

cpp/src/arrow/compute/api_aggregate.h

pitrou · 2021-03-22T17:36:07Z

cpp/src/arrow/compute/api_aggregate.h

+  /// Get current unique keys. May be called multiple times.
+  virtual Result<ExecBatch> GetUniques() = 0;
+
+  /// Get the current number of groups.


"groups" or "keys"? Vocabulary should probably be consistent accross docstrings.

The number of key(column)s is fixed throughout the lifetime of the Grouper. The number of groups is incremented each time a unique row of keys is encountered.

Hmm, I see. But the docstring above talks about "current unique keys"...

I'll include a definition of keys, unique keys, and groups in the doccomment for Grouper

cpp/src/arrow/compute/kernels/hash_aggregate.cc

…ests

bkietz · 2021-03-23T20:05:40Z

+1, merging

bkietz requested review from pitrou and wesm March 2, 2021 21:42

wesm reviewed Mar 2, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/aggregate_basic.cc Outdated Show resolved Hide resolved

github-actions bot added the Component: C++ label Mar 2, 2021

pitrou reviewed Mar 4, 2021

View reviewed changes

bkietz force-pushed the groupby1 branch 2 times, most recently from b09fba2 to 38d3eea Compare March 17, 2021 21:43

bkietz marked this pull request as ready for review March 17, 2021 21:44

github-actions bot added the Component: R label Mar 17, 2021

nealrichardson approved these changes Mar 17, 2021

View reviewed changes

michalursa and others added 15 commits March 19, 2021 11:53

ARROW-11591: [C++] Prototype version of hash group by

32cfbb4

extract sum and count to GroupedAggregator interface

3f65726

implement sum for more DataTypes

3792de8

Add support for multiple key columns in group by (no testing yet)

4fe0613

add randomized testing for group_by

c72a178

Fixing bugs in group_by. Current tests should be passing now.

3085d04

formatting

e7fee76

add simple group_by benchmark

879b6fd

add group_by benchmark keyed on integers

6377dd9

store key bytes -> group id mapping with unordered_map

04e4f76

repair null bitmap materialization, simplify testing against random data

2b7b76e

simplify ValidateGroupBy further

3b82be6

update compute.rst with group_by

0d71442

remove named output fields

9bf51e6

reference ARROW-11840, add FunctionDoc

8ec135e

bkietz added 7 commits March 19, 2021 11:53

lint: nullptr

a57268c

fix benchmark

ea4d387

crispy compilers

fff5985

fix unaligned load

362aa4f

bitshift width was unclear

a39e941

msvc: explicit cast

b0e410f

GroupIdentifier->Grouper, use HashAggregateFunctions to store kernels

9266067

bkietz force-pushed the groupby1 branch from 1427be9 to 9266067 Compare March 19, 2021 18:26

bkietz added 8 commits March 20, 2021 21:35

add unit tests for Grouper

1c2973b

remove dataset::{MakeGrouping,ApplyGroupings}

a0114c9

provide Grouper::num_groups()

63b833d

validate ApplyGroupings

d0c01ac

rewrite tests for readability, remove redundant cases

d0006e9

~max_id~ -> num_groups

d437683

rename FunctionDoc vars to enable unity build

a07fc66

Expose HashAggregateKernel in python, ensure hash_agg funcs can't Exe…

201fa3b

…cute

pitrou requested changes Mar 22, 2021

View reviewed changes

github-actions bot added the Component: Python label Mar 22, 2021

bkietz added 5 commits March 22, 2021 15:06

__dllexport ExecBatch

5fa524e

BufferBuilder* bytes_builder()

46b4069

review comments

5b79c32

add more benchmarks

2bff008

rename python tests, export HashAgg*, don't ref codegen_internal in t…

147fe1f

…ests

bkietz closed this in e2440a3 Mar 23, 2021

chaokunyang mentioned this pull request May 24, 2022

Arrow Compute chaokunyang/blog#9

Open

This was referenced Mar 23, 2021

[C++][Compute] Prototype version of hash aggregation #27461

Closed

[C++] Provide ArrayBuilder::AppendScalar #27769

Closed


		namespace internal {

		/// Internal use only: streaming group identifier.

Uh oh!

ARROW-11591: [C++][Compute] Grouped aggregation #9621

ARROW-11591: [C++][Compute] Grouped aggregation #9621

Uh oh!

Conversation

bkietz commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 2, 2021

Uh oh!

wesm commented Mar 3, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nealrichardson commented Mar 4, 2021

Uh oh!

michalursa commented Mar 5, 2021

Uh oh!

bkietz commented Mar 17, 2021

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkietz commented Mar 23, 2021

Uh oh!

Reviewers

Assignees

Labels

bkietz commented Mar 2, 2021 •

edited

Loading

wesm left a comment •

edited

Loading