-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for configuring Theta and Tuple aggregation functions #14167
Add support for configuring Theta and Tuple aggregation functions #14167
Conversation
…ms in the ST index This patch introduces a mechanism to allow configuring the aggregation function parameters for a star-tree index for Tuple and Theta sketches. Any existing aggregation that has a precision greater or equal to that of the query precision is selected as a candidate. The behaviour of the CPC aggregator has been changed accordingly.
@yashmayya please could you review this PR when you have a chance. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #14167 +/- ##
============================================
+ Coverage 61.75% 63.90% +2.15%
- Complexity 207 1531 +1324
============================================
Files 2436 2621 +185
Lines 133233 144064 +10831
Branches 20636 22052 +1416
============================================
+ Hits 82274 92066 +9792
- Misses 44911 45174 +263
- Partials 6048 6824 +776
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @davecromberge, LGTM! I just had a few minor comments and questions.
} | ||
// Check if the query lgK param is less than or equal to that of the StarTree aggregation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a one-liner explanation on why we're doing a <= check (#13835 (comment)) might be useful for future readers here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
public static final String THETASKETCH_NOMINAL_ENTRIES = "K"; | ||
public static final String TUPLESKETCH_NOMINAL_ENTRIES = "K"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would nominalEntries
potentially be more user-friendly than K
? Or maybe we could accept either. Also, I think we could use a single constant for this (THETA_TUPLE_SKETCH_NOMINAL_ENTRIES
) similar to the p
key for HLLPLUS
/ ULL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datasketches library code uses K throughout but I agree that it is more user friendly to use your suggestion and it also aligns with the parameter name that is passed to the aggregation function.
// the default value for nominal entries | ||
starTreeNominalEntries = CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES; | ||
} | ||
// Check if the query lgK param is less than or equal to that of the StarTree aggregation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Check if the query lgK param is less than or equal to that of the StarTree aggregation | |
// Check if the query nominalEntries param is less than or equal to that of the StarTree aggregation |
nit: looks like this aggregation function directly takes the K
value rather than lgK
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I suppose it doesn't matter either way since the comparisons are equivalent.
// index was built with | ||
// the default value for nominal entries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// index was built with | |
// the default value for nominal entries | |
// index was built with the default value for nominal entries |
nit: formatting
List.of(ExpressionContext.forIdentifier("col"), | ||
ExpressionContext.forLiteral(Literal.stringValue("nominalEntries=32768")))); | ||
|
||
// Default StarTree lgK = 14 / K=16384 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious - do you know why we chose lgK = 14
as the default value for the star-tree index but lgK = 12
for the query time aggregation function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is not the case for the tuple sketch where the same default value is used across the star-tree index value aggregator and the query aggregation function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation. The ThetaSketch query time aggregation function was the earliest implementation of a Datasketches sketch and was used at Linkedin according to this presentation. This implementation provided a default of lgK=12
which has been preserved.
The StarTree index value aggregator for ThetaSketch was introduced later in this PR. The default chosen for the StarTree was of higher precision to allow the user greater accuracy should this be desired by overriding the default query parameter value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for the history!
function = new DistinctCountThetaSketchAggregationFunction(List.of(ExpressionContext.forIdentifier("col"), | ||
ExpressionContext.forLiteral(Literal.stringValue("nominalEntries=16384")))); | ||
|
||
Assert.assertTrue(function.canUseStarTree(Map.of())); | ||
Assert.assertTrue(function.canUseStarTree(Map.of(Constants.THETASKETCH_NOMINAL_ENTRIES, "16384"))); | ||
Assert.assertTrue(function.canUseStarTree(Map.of(Constants.THETASKETCH_NOMINAL_ENTRIES, 16384))); | ||
Assert.assertFalse(function.canUseStarTree(Map.of(Constants.THETASKETCH_NOMINAL_ENTRIES, 8192))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be in the CustomK
test rather than the DefaultK
test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or else nominalEntries=4096
if this was intended to test the case where the default value is explicitly set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct and I have removed these test assertions.
Thanks for your review @yashmayya I've attempted to address your feedback. |
List.of(ExpressionContext.forIdentifier("col"), | ||
ExpressionContext.forLiteral(Literal.stringValue("nominalEntries=32768")))); | ||
|
||
// Default StarTree lgK = 14 / K=16384 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for the history!
@davecromberge Thanks for the contribution! Can you help also update the pinot documentation about this new argument? |
Yes @Jackie-Jiang will be happy to get this done. |
Applies to StarTree Index
This patch introduces a mechanism to allow configuring the aggregation function parameters for a star-tree index for Tuple and Theta sketches. Any existing aggregation that has a precision greater or equal to that of the query precision is selected as a candidate. The behaviour of the CPC aggregator has been changed accordingly.
This PR can be tagged as a feature.
release-notes
:nominalEntries
for Theta Sketch StarTree value aggregatornominalEntries
for Tuple Sketch StarTree value aggregator