Description
Is your feature request related to a problem or challenge?
Currently DF uses a RawTable
from hashbrown as the hash table implementation in group aggregations. This requires first converting the input batches into a row format, and then process the converted rows one by one, does hash probing, equality check, as well as creating new entries accordingly.
A different approach, as discussed in the Photon paper (and is also used by DuckDB), is to adopt a new vectorized approach in the hash table design, so that each of the above steps can be vectorized. In addition this allows us to skip the row conversion and directly operates on the input batches.
Internally we have a draft implementation for this and it has shown considerable improvements (even without SIMD, although with a lot of unsafe
s 😂 ) on top of the current hash aggregation approach, so we'd like to contribute to DF and see if it can help to improve its aggregation performance even further.
Describe the solution you'd like
Design & implement a separate vectorized hash table. It can either replace the existing RawTable
inside GroupValuesRows
, or we can have a separate GroupValues
implementation.
Describe alternatives you've considered
Not to implement this.
Additional context
No response