Skip to content

Evaluate vectorized hash table for group aggregation #7095

Closed
@sunchao

Description

@sunchao

Is your feature request related to a problem or challenge?

Currently DF uses a RawTable from hashbrown as the hash table implementation in group aggregations. This requires first converting the input batches into a row format, and then process the converted rows one by one, does hash probing, equality check, as well as creating new entries accordingly.

A different approach, as discussed in the Photon paper (and is also used by DuckDB), is to adopt a new vectorized approach in the hash table design, so that each of the above steps can be vectorized. In addition this allows us to skip the row conversion and directly operates on the input batches.

Internally we have a draft implementation for this and it has shown considerable improvements (even without SIMD, although with a lot of unsafes 😂 ) on top of the current hash aggregation approach, so we'd like to contribute to DF and see if it can help to improve its aggregation performance even further.

Describe the solution you'd like

Design & implement a separate vectorized hash table. It can either replace the existing RawTable inside GroupValuesRows, or we can have a separate GroupValues implementation.

Describe alternatives you've considered

Not to implement this.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions