Skip to content

Improve grouping performance by special casing small / fixed size keys #846

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The improved grouping algorithm on #790 improves grouping performance in general for DataFusion and is also general in that it works for all types of keys.

However, @sundy-li noted on #790 (comment) that additional performance is likely possible by special casing "small" and fixed sized keys.

Describe the solution you'd like
From @sundy-li ' comment:

Introduce the variant hash methods would help in this case.
E.G:

Query which group by 3 columns, which are [u8, u8, u16], a fixed hash key U32 will be enough.

  1. We can allocate one large fixed memory than multiple vec allocate.
  2. The fixed key saves the hash map memory size.

Refer:
https://github.com/datafuselabs/datafuse/blob/master/common/datablocks/src/kernels/data_block_group_by.rs#L17-L36

https://github.com/datafuselabs/datafuse/blob/master/common/datablocks/src/kernels/data_block_group_by_hash.rs#L264-L274

Alternate Ideas

@Dandandan also suggests that for small ranges / data types we can even avoid using a hash table and move to direct indexing instead. That might be interesting for u8 values or small dictionaries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions