Skip to content

Improve aggregate performance by special casing single group keys #6969

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This is another observation to make aggregation queries go faster

After #6904 the speed of grouping is much faster as much of the per-aggregate state overhead is removed.

Now, for queries with a single primitive type (e.g. Int64) a significant portion of their execution time is the converting the grouping key into the the arrow Row format: https://docs.rs/arrow-row/43.0.0/arrow_row/index.html to compare

An example of such a query is to find the number of interactions with some user id:

SELECT count(*)
FROM t
GROUP BY user_id

Describe the solution you'd like

As @Dandandan suggested on #6800 (comment)

we should also consider avoiding the conversion to the row-format as this likely will be one of the more expensive things now.

So that would mean that for group by with a single column, we could use the (native) representation of that column to store the group values rather than using the Arrow Row format.

This is the same approach taken today by Sort / SortPreservingMerge. It is implemented via a generic FieldCursor:

https://github.com/apache/arrow-datafusion/blob/d316702722e6c301fdb23a9698f7ec415ef548e9/datafusion/core/src/physical_plan/sorts/cursor.rs#L180-L191

Describe alternatives you've considered

@yahoNanJing has also been exploring a fixed width row cursor implementation that is optimized for comparison rather than ordering: apache/arrow-rs#4524

Additional context

There is some discussion in the ASF slack channel as well: https://the-asf.slack.com/archives/C01QUFS30TD/p1689301661826909

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions