Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for configuring Theta and Tuple aggregation functions #14167

Merged

Conversation

davecromberge
Copy link
Member

@davecromberge davecromberge commented Oct 4, 2024

Applies to StarTree Index

This patch introduces a mechanism to allow configuring the aggregation function parameters for a star-tree index for Tuple and Theta sketches. Any existing aggregation that has a precision greater or equal to that of the query precision is selected as a candidate. The behaviour of the CPC aggregator has been changed accordingly.

This PR can be tagged as a feature.

release-notes:

  • New function parameter nominalEntries for Theta Sketch StarTree value aggregator
  • New function parameter nominalEntries for Tuple Sketch StarTree value aggregator

…ms in the ST index

This patch introduces a mechanism to allow configuring the aggregation function parameters
for a star-tree index for Tuple and Theta sketches.  Any existing aggregation that has a
precision greater or equal to that of the query precision is selected as a candidate.
The behaviour of the CPC aggregator has been changed accordingly.
@davecromberge
Copy link
Member Author

@yashmayya please could you review this PR when you have a chance.

@codecov-commenter
Copy link

codecov-commenter commented Oct 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.90%. Comparing base (59551e4) to head (ed0133b).
Report is 1142 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14167      +/-   ##
============================================
+ Coverage     61.75%   63.90%   +2.15%     
- Complexity      207     1531    +1324     
============================================
  Files          2436     2621     +185     
  Lines        133233   144064   +10831     
  Branches      20636    22052    +1416     
============================================
+ Hits          82274    92066    +9792     
- Misses        44911    45174     +263     
- Partials       6048     6824     +776     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.84% <100.00%> (+2.13%) ⬆️
java-21 63.73% <100.00%> (+2.11%) ⬆️
skip-bytebuffers-false 63.87% <100.00%> (+2.12%) ⬆️
skip-bytebuffers-true 63.71% <100.00%> (+35.98%) ⬆️
temurin 63.90% <100.00%> (+2.15%) ⬆️
unittests 63.90% <100.00%> (+2.15%) ⬆️
unittests1 55.47% <68.42%> (+8.58%) ⬆️
unittests2 34.44% <50.00%> (+6.71%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davecromberge, LGTM! I just had a few minor comments and questions.

}
// Check if the query lgK param is less than or equal to that of the StarTree aggregation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a one-liner explanation on why we're doing a <= check (#13835 (comment)) might be useful for future readers here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 32 to 33
public static final String THETASKETCH_NOMINAL_ENTRIES = "K";
public static final String TUPLESKETCH_NOMINAL_ENTRIES = "K";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would nominalEntries potentially be more user-friendly than K? Or maybe we could accept either. Also, I think we could use a single constant for this (THETA_TUPLE_SKETCH_NOMINAL_ENTRIES) similar to the p key for HLLPLUS / ULL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datasketches library code uses K throughout but I agree that it is more user friendly to use your suggestion and it also aligns with the parameter name that is passed to the aggregation function.

// the default value for nominal entries
starTreeNominalEntries = CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES;
}
// Check if the query lgK param is less than or equal to that of the StarTree aggregation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Check if the query lgK param is less than or equal to that of the StarTree aggregation
// Check if the query nominalEntries param is less than or equal to that of the StarTree aggregation

nit: looks like this aggregation function directly takes the K value rather than lgK?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I suppose it doesn't matter either way since the comparisons are equivalent.

Comment on lines 289 to 290
// index was built with
// the default value for nominal entries
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// index was built with
// the default value for nominal entries
// index was built with the default value for nominal entries

nit: formatting

List.of(ExpressionContext.forIdentifier("col"),
ExpressionContext.forLiteral(Literal.stringValue("nominalEntries=32768"))));

// Default StarTree lgK = 14 / K=16384
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious - do you know why we chose lgK = 14 as the default value for the star-tree index but lgK = 12 for the query time aggregation function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is not the case for the tuple sketch where the same default value is used across the star-tree index value aggregator and the query aggregation function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. The ThetaSketch query time aggregation function was the earliest implementation of a Datasketches sketch and was used at Linkedin according to this presentation. This implementation provided a default of lgK=12 which has been preserved.

The StarTree index value aggregator for ThetaSketch was introduced later in this PR. The default chosen for the StarTree was of higher precision to allow the user greater accuracy should this be desired by overriding the default query parameter value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the history!

Comment on lines 43 to 49
function = new DistinctCountThetaSketchAggregationFunction(List.of(ExpressionContext.forIdentifier("col"),
ExpressionContext.forLiteral(Literal.stringValue("nominalEntries=16384"))));

Assert.assertTrue(function.canUseStarTree(Map.of()));
Assert.assertTrue(function.canUseStarTree(Map.of(Constants.THETASKETCH_NOMINAL_ENTRIES, "16384")));
Assert.assertTrue(function.canUseStarTree(Map.of(Constants.THETASKETCH_NOMINAL_ENTRIES, 16384)));
Assert.assertFalse(function.canUseStarTree(Map.of(Constants.THETASKETCH_NOMINAL_ENTRIES, 8192)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be in the CustomK test rather than the DefaultK test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or else nominalEntries=4096 if this was intended to test the case where the default value is explicitly set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct and I have removed these test assertions.

@davecromberge
Copy link
Member Author

Thanks for your review @yashmayya I've attempted to address your feedback.

List.of(ExpressionContext.forIdentifier("col"),
ExpressionContext.forLiteral(Literal.stringValue("nominalEntries=32768"))));

// Default StarTree lgK = 14 / K=16384
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the history!

@yashmayya yashmayya requested a review from Jackie-Jiang October 7, 2024 08:12
@Jackie-Jiang Jackie-Jiang merged commit 7202ead into apache:master Oct 8, 2024
21 checks passed
@Jackie-Jiang
Copy link
Contributor

@davecromberge Thanks for the contribution! Can you help also update the pinot documentation about this new argument?

@davecromberge
Copy link
Member Author

@davecromberge Thanks for the contribution! Can you help also update the pinot documentation about this new argument?

Yes @Jackie-Jiang will be happy to get this done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants