-
Couldn't load subscription status.
- Fork 3.9k
ARROW-11591: [C++][Compute] Grouped aggregation #9621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
32cfbb4
3f65726
3792de8
4fe0613
c72a178
3085d04
e7fee76
879b6fd
6377dd9
04e4f76
2b7b76e
3b82be6
0d71442
9bf51e6
8ec135e
3e95d6e
4694a11
6a105c3
9ece614
d4e3f11
eb90bf6
2ab608d
0d4b1d4
2f939f5
2651dfd
f612ef8
f1b1664
b44f13b
23c2cf6
a57268c
ea4d387
fff5985
362aa4f
a39e941
b0e410f
9266067
1c2973b
a0114c9
63b833d
d0c01ac
d0006e9
d437683
a07fc66
201fa3b
5fa524e
46b4069
5b79c32
2bff008
147fe1f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -306,5 +306,102 @@ Result<Datum> TDigest(const Datum& value, | |
| const TDigestOptions& options = TDigestOptions::Defaults(), | ||
| ExecContext* ctx = NULLPTR); | ||
|
|
||
| namespace internal { | ||
|
|
||
| /// Internal use only: streaming group identifier. | ||
| /// Consumes batches of keys and yields batches of the group ids. | ||
| class ARROW_EXPORT Grouper { | ||
| public: | ||
| virtual ~Grouper() = default; | ||
|
|
||
| /// Construct a Grouper which receives the specified key types | ||
| static Result<std::unique_ptr<Grouper>> Make(const std::vector<ValueDescr>& descrs, | ||
| ExecContext* ctx = default_exec_context()); | ||
|
|
||
| /// Consume a batch of keys, producing the corresponding group ids as an integer array. | ||
bkietz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| /// Currently only uint32 indices will be produced, eventually the bit width will only | ||
| /// be as wide as necessary. | ||
| virtual Result<Datum> Consume(const ExecBatch& batch) = 0; | ||
|
|
||
| /// Get current unique keys. May be called multiple times. | ||
| virtual Result<ExecBatch> GetUniques() = 0; | ||
|
|
||
| /// Get the current number of groups. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "groups" or "keys"? Vocabulary should probably be consistent accross docstrings. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The number of key(column)s is fixed throughout the lifetime of the Grouper. The number of groups is incremented each time a unique row of keys is encountered. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I see. But the docstring above talks about "current unique keys"... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll include a definition of keys, unique keys, and groups in the doccomment for Grouper |
||
| virtual uint32_t num_groups() const = 0; | ||
|
|
||
| /// \brief Assemble lists of indices of identical elements. | ||
| /// | ||
| /// \param[in] ids An unsigned, all-valid integral array which will be | ||
| /// used as grouping criteria. | ||
| /// \param[in] num_groups An upper bound for the elements of ids | ||
bkietz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| /// \return A num_groups-long ListArray where the slot at i contains a | ||
| /// list of indices where i appears in ids. | ||
| /// | ||
| /// MakeGroupings([ | ||
| /// 2, | ||
| /// 2, | ||
| /// 5, | ||
| /// 5, | ||
| /// 2, | ||
| /// 3 | ||
| /// ], 8) == [ | ||
| /// [], | ||
| /// [], | ||
| /// [0, 1, 4], | ||
| /// [5], | ||
| /// [], | ||
| /// [2, 3], | ||
| /// [], | ||
| /// [] | ||
| /// ] | ||
| static Result<std::shared_ptr<ListArray>> MakeGroupings( | ||
| const UInt32Array& ids, uint32_t num_groups, | ||
| ExecContext* ctx = default_exec_context()); | ||
|
|
||
| /// \brief Produce a ListArray whose slots are selections of `array` which correspond to | ||
| /// the provided groupings. | ||
| /// | ||
| /// For example, | ||
| /// ApplyGroupings([ | ||
| /// [], | ||
| /// [], | ||
| /// [0, 1, 4], | ||
| /// [5], | ||
| /// [], | ||
| /// [2, 3], | ||
| /// [], | ||
| /// [] | ||
| /// ], [2, 2, 5, 5, 2, 3]) == [ | ||
| /// [], | ||
| /// [], | ||
| /// [2, 2, 2], | ||
| /// [3], | ||
| /// [], | ||
| /// [5, 5], | ||
| /// [], | ||
| /// [] | ||
| /// ] | ||
| static Result<std::shared_ptr<ListArray>> ApplyGroupings( | ||
| const ListArray& groupings, const Array& array, | ||
| ExecContext* ctx = default_exec_context()); | ||
| }; | ||
|
|
||
| /// \brief Configure a grouped aggregation | ||
| struct ARROW_EXPORT Aggregate { | ||
| /// the name of the aggregation function | ||
| std::string function; | ||
|
|
||
| /// options for the aggregation function | ||
| const FunctionOptions* options; | ||
| }; | ||
|
|
||
| /// Internal use only: helper function for testing HashAggregateKernels. | ||
| /// This will be replaced by streaming execution operators. | ||
| ARROW_EXPORT | ||
| Result<Datum> GroupBy(const std::vector<Datum>& arguments, const std::vector<Datum>& keys, | ||
| const std::vector<Aggregate>& aggregates, | ||
| ExecContext* ctx = default_exec_context()); | ||
|
|
||
| } // namespace internal | ||
| } // namespace compute | ||
| } // namespace arrow | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's internal, why is it exposed in
api_aggregate.h? I would expect another header, e.g.compute/group_by_internal.h.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are made available for testing from R, which could not access an _internal header (since it wouldn't be installed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, fair enough. But what's the point of calling those from R?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing purposes--we found a bunch of issues in earlier iterations by experimenting with this in R.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allowed @nealrichardson to explore the hash aggregate kernels and expose a number of issues. We'll probably remove
GroupByaltogether in ARROW-12010