Description
Is your feature request related to a problem or challenge?
This is another observation to make aggregation queries go faster
After #6904 the speed of grouping is much faster as much of the per-aggregate state overhead is removed.
Now, for queries with a single primitive type (e.g. Int64) a significant portion of their execution time is the converting the grouping key into the the arrow Row
format: https://docs.rs/arrow-row/43.0.0/arrow_row/index.html to compare
An example of such a query is to find the number of interactions with some user id:
SELECT count(*)
FROM t
GROUP BY user_id
Describe the solution you'd like
As @Dandandan suggested on #6800 (comment)
we should also consider avoiding the conversion to the row-format as this likely will be one of the more expensive things now.
So that would mean that for group by with a single column, we could use the (native) representation of that column to store the group values rather than using the Arrow Row format.
This is the same approach taken today by Sort / SortPreservingMerge. It is implemented via a generic FieldCursor
:
Describe alternatives you've considered
@yahoNanJing has also been exploring a fixed width row cursor implementation that is optimized for comparison rather than ordering: apache/arrow-rs#4524
Additional context
There is some discussion in the ASF slack channel as well: https://the-asf.slack.com/archives/C01QUFS30TD/p1689301661826909