Add support for configuring Theta and Tuple aggregation functions #14167

davecromberge · 2024-10-04T16:41:53Z

Applies to StarTree Index

This patch introduces a mechanism to allow configuring the aggregation function parameters for a star-tree index for Tuple and Theta sketches. Any existing aggregation that has a precision greater or equal to that of the query precision is selected as a candidate. The behaviour of the CPC aggregator has been changed accordingly.

This PR can be tagged as a feature.

release-notes:

New function parameter nominalEntries for Theta Sketch StarTree value aggregator
New function parameter nominalEntries for Tuple Sketch StarTree value aggregator

…ms in the ST index This patch introduces a mechanism to allow configuring the aggregation function parameters for a star-tree index for Tuple and Theta sketches. Any existing aggregation that has a precision greater or equal to that of the query precision is selected as a candidate. The behaviour of the CPC aggregator has been changed accordingly.

davecromberge · 2024-10-04T16:42:24Z

@yashmayya please could you review this PR when you have a chance.

codecov-commenter · 2024-10-04T17:17:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.90%. Comparing base (59551e4) to head (ed0133b).
Report is 1142 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14167      +/-   ##
============================================
+ Coverage     61.75%   63.90%   +2.15%     
- Complexity      207     1531    +1324     
============================================
  Files          2436     2621     +185     
  Lines        133233   144064   +10831     
  Branches      20636    22052    +1416     
============================================
+ Hits          82274    92066    +9792     
- Misses        44911    45174     +263     
- Partials       6048     6824     +776

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.84% <100.00%> (+2.13%)`	⬆️
java-21	`63.73% <100.00%> (+2.11%)`	⬆️
skip-bytebuffers-false	`63.87% <100.00%> (+2.12%)`	⬆️
skip-bytebuffers-true	`63.71% <100.00%> (+35.98%)`	⬆️
temurin	`63.90% <100.00%> (+2.15%)`	⬆️
unittests	`63.90% <100.00%> (+2.15%)`	⬆️
unittests1	`55.47% <68.42%> (+8.58%)`	⬆️
unittests2	`34.44% <50.00%> (+6.71%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yashmayya

Thanks @davecromberge, LGTM! I just had a few minor comments and questions.

yashmayya · 2024-10-05T11:15:28Z

.../apache/pinot/core/query/aggregation/function/DistinctCountCPCSketchAggregationFunction.java

    }
+    // Check if the query lgK param is less than or equal to that of the StarTree aggregation


nit: a one-liner explanation on why we're doing a <= check (#13835 (comment)) might be useful for future readers here.

yashmayya · 2024-10-05T11:17:46Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/Constants.java

+  public static final String THETASKETCH_NOMINAL_ENTRIES = "K";
+  public static final String TUPLESKETCH_NOMINAL_ENTRIES = "K";


Would nominalEntries potentially be more user-friendly than K? Or maybe we could accept either. Also, I think we could use a single constant for this (THETA_TUPLE_SKETCH_NOMINAL_ENTRIES) similar to the p key for HLLPLUS / ULL?

The datasketches library code uses K throughout but I agree that it is more user friendly to use your suggestion and it also aligns with the parameter name that is passed to the aggregation function.

yashmayya · 2024-10-05T11:20:29Z

...pache/pinot/core/query/aggregation/function/DistinctCountThetaSketchAggregationFunction.java

+      // the default value for nominal entries
+      starTreeNominalEntries = CommonConstants.Helix.DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES;
+    }
+    // Check if the query lgK param is less than or equal to that of the StarTree aggregation


Suggested change

// Check if the query lgK param is less than or equal to that of the StarTree aggregation

// Check if the query nominalEntries param is less than or equal to that of the StarTree aggregation

nit: looks like this aggregation function directly takes the K value rather than lgK?

Although I suppose it doesn't matter either way since the comparisons are equivalent.

yashmayya · 2024-10-05T11:21:24Z

.../org/apache/pinot/core/query/aggregation/function/IntegerTupleSketchAggregationFunction.java

+      // index was built with
+      // the default value for nominal entries


Suggested change

// index was built with

// the default value for nominal entries

// index was built with the default value for nominal entries

nit: formatting

yashmayya · 2024-10-05T11:27:03Z

...e/pinot/core/query/aggregation/function/DistinctCountThetaSketchAggregationFunctionTest.java

+        List.of(ExpressionContext.forIdentifier("col"),
+            ExpressionContext.forLiteral(Literal.stringValue("nominalEntries=32768"))));
+
+    // Default StarTree lgK = 14 / K=16384


I'm curious - do you know why we chose lgK = 14 as the default value for the star-tree index but lgK = 12 for the query time aggregation function?

Looks like this is not the case for the tuple sketch where the same default value is used across the star-tree index value aggregator and the query aggregation function?

Good observation. The ThetaSketch query time aggregation function was the earliest implementation of a Datasketches sketch and was used at Linkedin according to this presentation. This implementation provided a default of lgK=12 which has been preserved.

The StarTree index value aggregator for ThetaSketch was introduced later in this PR. The default chosen for the StarTree was of higher precision to allow the user greater accuracy should this be desired by overriding the default query parameter value.

Makes sense, thanks for the history!

yashmayya · 2024-10-05T11:30:27Z

...e/pinot/core/query/aggregation/function/DistinctCountThetaSketchAggregationFunctionTest.java

+    function = new DistinctCountThetaSketchAggregationFunction(List.of(ExpressionContext.forIdentifier("col"),
+        ExpressionContext.forLiteral(Literal.stringValue("nominalEntries=16384"))));
+
+    Assert.assertTrue(function.canUseStarTree(Map.of()));
+    Assert.assertTrue(function.canUseStarTree(Map.of(Constants.THETASKETCH_NOMINAL_ENTRIES, "16384")));
+    Assert.assertTrue(function.canUseStarTree(Map.of(Constants.THETASKETCH_NOMINAL_ENTRIES, 16384)));
+    Assert.assertFalse(function.canUseStarTree(Map.of(Constants.THETASKETCH_NOMINAL_ENTRIES, 8192)));


Shouldn't this be in the CustomK test rather than the DefaultK test?

Or else nominalEntries=4096 if this was intended to test the case where the default value is explicitly set.

You are correct and I have removed these test assertions.

davecromberge · 2024-10-07T06:21:16Z

Thanks for your review @yashmayya I've attempted to address your feedback.

yashmayya · 2024-10-07T08:10:41Z

...e/pinot/core/query/aggregation/function/DistinctCountThetaSketchAggregationFunctionTest.java

+        List.of(ExpressionContext.forIdentifier("col"),
+            ExpressionContext.forLiteral(Literal.stringValue("nominalEntries=32768"))));
+
+    // Default StarTree lgK = 14 / K=16384


Makes sense, thanks for the history!

Jackie-Jiang · 2024-10-08T21:52:19Z

@davecromberge Thanks for the contribution! Can you help also update the pinot documentation about this new argument?

davecromberge · 2024-10-16T09:49:09Z

@davecromberge Thanks for the contribution! Can you help also update the pinot documentation about this new argument?

Yes @Jackie-Jiang will be happy to get this done.

Refers to PR: apache/pinot#14167

Fix checkstyle violations

20f21a9

yashmayya added feature star-tree index labels Oct 5, 2024

yashmayya approved these changes Oct 5, 2024

View reviewed changes

davecromberge added 5 commits October 7, 2024 06:59

Rename and consolidate constant

5671705

Make correction to inline documentation

a598538

Formatting corrections to inline documentation

b32b9e0

Remove duplicate test case assertions for custom nominal entries

9a64fdd

Explanation for LEQ check on canUseStarTree for Datasketches

ed0133b

davecromberge requested a review from yashmayya October 7, 2024 06:21

yashmayya approved these changes Oct 7, 2024

View reviewed changes

yashmayya requested a review from Jackie-Jiang October 7, 2024 08:12

Jackie-Jiang approved these changes Oct 8, 2024

View reviewed changes

Jackie-Jiang merged commit 7202ead into apache:master Oct 8, 2024
21 checks passed

Jackie-Jiang added the documentation label Oct 8, 2024

davecromberge added a commit to davecromberge/pinot-docs that referenced this pull request Oct 24, 2024

Docs for additional theta and tuple sketch function params

b550d95

Refers to PR: apache/pinot#14167

davecromberge mentioned this pull request Oct 24, 2024

Docs for additional theta and tuple sketch function params pinot-contrib/pinot-docs#388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for configuring Theta and Tuple aggregation functions #14167

Add support for configuring Theta and Tuple aggregation functions #14167

davecromberge commented Oct 4, 2024 •

edited

Loading

davecromberge commented Oct 4, 2024

codecov-commenter commented Oct 4, 2024 •

edited

Loading

yashmayya left a comment

yashmayya Oct 5, 2024

davecromberge Oct 7, 2024

yashmayya Oct 5, 2024

davecromberge Oct 7, 2024

yashmayya Oct 5, 2024

yashmayya Oct 5, 2024

yashmayya Oct 5, 2024

yashmayya Oct 5, 2024

yashmayya Oct 5, 2024

davecromberge Oct 7, 2024

yashmayya Oct 7, 2024

yashmayya Oct 5, 2024

yashmayya Oct 5, 2024

davecromberge Oct 7, 2024

davecromberge commented Oct 7, 2024

yashmayya Oct 7, 2024

Jackie-Jiang commented Oct 8, 2024

davecromberge commented Oct 16, 2024

		}
		// Check if the query lgK param is less than or equal to that of the StarTree aggregation

		public static final String THETASKETCH_NOMINAL_ENTRIES = "K";
		public static final String TUPLESKETCH_NOMINAL_ENTRIES = "K";

	// Check if the query lgK param is less than or equal to that of the StarTree aggregation
	// Check if the query nominalEntries param is less than or equal to that of the StarTree aggregation

		// index was built with
		// the default value for nominal entries

	// index was built with
	// the default value for nominal entries
	// index was built with the default value for nominal entries

Add support for configuring Theta and Tuple aggregation functions #14167

Add support for configuring Theta and Tuple aggregation functions #14167

Conversation

davecromberge commented Oct 4, 2024 • edited Loading

davecromberge commented Oct 4, 2024

codecov-commenter commented Oct 4, 2024 • edited Loading

Codecov Report

yashmayya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davecromberge commented Oct 7, 2024

Choose a reason for hiding this comment

Jackie-Jiang commented Oct 8, 2024

davecromberge commented Oct 16, 2024

davecromberge commented Oct 4, 2024 •

edited

Loading

codecov-commenter commented Oct 4, 2024 •

edited

Loading