[Website] Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0#386
Merged
alamb merged 4 commits intoapache:mainfrom Aug 14, 2023
Merged
[Website] Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0#386alamb merged 4 commits intoapache:mainfrom
alamb merged 4 commits intoapache:mainfrom
Conversation
tustvold
approved these changes
Aug 8, 2023
Dandandan
approved these changes
Aug 9, 2023
Contributor
Author
|
I plan to publish this sometime early next week (2023-08-14 or so), to ensure there has been at least a week for anyone who is interested to review Here is the discussion on mailing list: https://lists.apache.org/thread/4lyk9jycr0o6qv5zo5bsw2q9mvvdsp7z Please let me know if anyone would like additional time to review |
yjshen
reviewed
Sep 10, 2023
| allocation using the arrow Row format | ||
| ``` | ||
|
|
||
| **Figure 5**: Hash group operator structure in DataFusion `28.0.0`. Group values are stored either directly in the hash table, or in a single allocation using the arrow Row format. The hash table contains group indexes. A single `GroupsAccumulator` stores the per-aggregate state for _all_ groups. |
Member
There was a problem hiding this comment.
Primitive group values are also stored in a single allocation using Vec<T::Native>, not directly in the hash table?
Contributor
There was a problem hiding this comment.
This was a later modification - apache/datafusion#7043
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes apache/datafusion#6988
Note: This describes work @tustvold @Dandandan and I did in DataFusion 28.0.0. This content was originally published on the InfluxData Blog but since it is general applicable to Apache Arrow DataFusion I would like to syndicate it here becase:
This is the same model we followed with https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ which was also republished on the arrow blog after the InfluxData blog
It also gives me an example to use my original ASCII art diagrams :)