Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PartitionedDistinctCount aggregation function #5786

Merged

Conversation

Jackie-Jiang
Copy link
Contributor

Add a new PartitionedDistinctCountAggregationFunction to calculate the number of distinct values when values are partitioned for each segment.
This function calculates the exact number of distinct values (using raw value instead of hash code) within the segment, then simply sums up the results from different segments to get the final result.

Copy link
Member

@kishoreg kishoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well done

@@ -123,6 +123,8 @@ public static AggregationFunction getAggregationFunction(FunctionContext functio
return new DistinctCountAggregationFunction(firstArgument);
case DISTINCTCOUNTBITMAP:
return new DistinctCountBitmapAggregationFunction(firstArgument);
case PARTITIONEDDISTINCTCOUNT:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to start using _ separators. I think the parser now handles both automatically

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. We should still keep no underscore within the aggregation function type, and parser will remove the underscore from the input function name to match the function type.

break;
case LONG:
long[] longValues = blockValSet.getLongValuesSV();
LongOpenHashSet longSet = aggregationResultHolder.getResult();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought roaring bitmap works for long as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a long version roaring bitmap Roaring64NavigableMap which is still in early stage within the current version of RoaringBitmap 0.8.0 we imported. We can test its performance when we upgrade the RoaringBitmap library for the latest version and change it later as it does not involve any backward-incompatible changes.

@@ -28,6 +28,7 @@
MINMAXRANGE("minMaxRange"),
DISTINCTCOUNT("distinctCount"),
DISTINCTCOUNTBITMAP("distinctCountBitmap"),
PARTITIONEDDISTINCTCOUNT("partitionedDistinctCount"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we change the function name to something like SEGMENTEDDISTINCTCOUT or let PARTITIONEDDISTINCTCOUNT to take a parameter of MergeLevel to indicate segment level or instance level merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Renamed to SEGMENTPARTITIONEDDISTINCTCOUNT

@Jackie-Jiang Jackie-Jiang force-pushed the partitioned_distinct_count branch from fa12da2 to c441c8a Compare August 3, 2020 20:36
@Jackie-Jiang Jackie-Jiang merged commit 657e245 into apache:master Aug 3, 2020
@Jackie-Jiang Jackie-Jiang deleted the partitioned_distinct_count branch August 3, 2020 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants