Partitioned Distinct/DistinctCount

For high cardinality columns, the local/intermediate/global merging phase of distinct(count) can be pretty memory/cpu heavy as the merger will need to ser/de and merge multiple large sets from the responses. In this case, if the distinct(count) column is partitioned into disjoint sets, then the merger can simply concat (for distinct) or add (for distinctcount) the intermediate results. This change can significantly reduce the set ser/de, transmission, and merge time/memory footprint. Meanwhile, it can be applicable to different levels of the processing depending on the partition granularity.

<img width="757" alt="Screenshot 2023-03-28 at 8 34 39 PM" src="https://user-images.githubusercontent.com/10736840/228420057-f4957793-1820-4a6b-9974-45ec0fc80190.png">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioned Distinct/DistinctCount #10499

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Partitioned Distinct/DistinctCount #10499

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions