Fix the data type handling in multi-stage engine by Jackie-Jiang · Pull Request #11453 · apache/pinot

Jackie-Jiang · 2023-08-29T02:34:51Z

Fix the wiring issue for value passing:
- Use internal stored value type within the engine (same as single-stage engine):
  - BOOLEAN: int
  - TIMESTAMP: long
  - BYTES: ByteArray
  - BOOLEAN_ARRAY: int[]
  - TIMESTAMP_ARRAY: long[]
- Only convert the value to external value type when used in UDF
Enhance leaf stage to only convert data types when necessary
Fix the wiring issue for null values
Correctly handle null in FilterOperand

codecov-commenter · 2023-08-29T03:53:11Z

Codecov Report

Merging #11453 (81b67da) into master (9e0752c) will decrease coverage by 0.07%.
Report is 1 commits behind head on master.
The diff coverage is 49.03%.

@@             Coverage Diff              @@
##             master   #11453      +/-   ##
============================================
- Coverage     62.98%   62.92%   -0.07%     
- Complexity     1099     1108       +9     
============================================
  Files          2302     2302              
  Lines        124040   124087      +47     
  Branches      18903    18960      +57     
============================================
- Hits          78131    78084      -47     
- Misses        40362    40460      +98     
+ Partials       5547     5543       -4

Flag	Coverage Δ
integration	`<0.01% <0.00%> (ø)`
integration1	`<0.01% <0.00%> (ø)`
integration2	`0.00% <0.00%> (ø)`
java-11	`62.88% <49.03%> (-0.08%)`	⬇️
java-17	`62.77% <49.03%> (-0.09%)`	⬇️
java-20	`62.77% <49.03%> (-0.09%)`	⬇️
temurin	`62.92% <49.03%> (-0.07%)`	⬇️
unittests	`62.92% <49.03%> (-0.07%)`	⬇️
unittests1	`67.41% <49.03%> (-0.12%)`	⬇️
unittests2	`14.48% <0.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
...ot/core/operator/blocks/InstanceResponseBlock.java	`75.51% <ø> (ø)`
...apache/pinot/query/planner/plannode/ValueNode.java	`38.88% <0.00%> (+7.07%)`	⬆️
.../apache/pinot/query/service/SubmissionService.java	`84.61% <ø> (-3.62%)`	⬇️
.../apache/pinot/common/datablock/DataBlockUtils.java	`27.71% <14.49%> (-12.99%)`	⬇️
...java/org/apache/pinot/common/utils/DataSchema.java	`62.68% <16.92%> (-6.25%)`	⬇️
...pache/pinot/query/planner/plannode/WindowNode.java	`81.57% <40.00%> (-0.93%)`	⬇️
.../java/org/apache/pinot/spi/utils/BooleanUtils.java	`35.71% <40.00%> (-9.75%)`	⬇️
.../pinot/query/runtime/operator/utils/TypeUtils.java	`63.15% <58.82%> (-31.58%)`	⬇️
.../pinot/query/planner/logical/LiteralHintUtils.java	`81.48% <60.00%> (+13.91%)`	⬆️
...sform/function/ScalarTransformFunctionWrapper.java	`81.20% <62.50%> (-0.23%)`	⬇️
... and 25 more

... and 15 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

gortiz · 2023-08-29T06:32:13Z

pinot-common/src/main/java/org/apache/pinot/common/datablock/DataBlockUtils.java

I would recommend to move loops like this one to each own method. That would help the JIT to inline and optimize the loops

All the loops are different (all combination of source type and dest type), and I guess the first thing JIT would do is to inline the method if we extract it out. We may revisit this in a separate thread

gortiz · 2023-08-29T06:36:05Z

pinot-common/src/main/java/org/apache/pinot/common/datablock/DataBlockUtils.java

nit: I know performance is not our priority but:

Depending on numRows it may be better to copy all values like in the not nullable case and then do a second loop where we nullify the null specific rows. Also, we can ask nullBitmap whether all values from rowId to rowId + numRows are null. That should be a very fast operation in roaring bitmaps and in case it happens, we can skip the whole loop.

I actually considered that, then decided not to do that for now because it might introduce much higher overhead for larger fields (e.g. STRING and BYTES). The changes made in this PR (most of boilerplate code is auto-generated by AI :-P) is guaranteed to be better than the old code by using tighter loop and reduce branching. We may open a separate thread for further performance improvement discussion

I don't think the jvm would inline these sub-methods in this one because this one is very large. But you are right, we can discuss about that later

gortiz · 2023-08-29T06:42:20Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataSchema.java

What is the meaning of internal and external in this context? It would be great to either specify that here or to include a link to some definition

Good point. Added more comments

gortiz · 2023-08-29T06:44:55Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataSchema.java

It is not clear to my why do we use int as internal boolean value

Added more comments explaining it. The internal stored type for BOOLEAN is INT (similarly the stored type for TIMESTAMP is LONG), and we always use the stored type throughout the engine

gortiz · 2023-08-29T06:46:24Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataSchema.java

here, for example, we are wasting a lot of memory in case the array is not small. This is only used for multivalued booleans? If that I guess it is not that bad because it shouldn't be used that often

Correct. These conversion happens only when we need to switch between internal type and external type. During query time, that can happen during scalar function invocation and final result rendering.

gortiz · 2023-08-29T06:51:23Z

pinot-core/src/main/java/org/apache/pinot/core/common/datablock/DataBlockBuilder.java

Is there any difference here? Just that the casting is going to be more restrictive, right? I mean, performance should be the same but now it would fail if value is not exactly an Integer

The main intention is not performance, but ensuring the wiring is correct and we can catch wrong wiring easily. Without this PR, a lot of wiring is wrong, but we are not able to catch them.
Added some comment explaining this

gortiz · 2023-08-29T07:18:55Z

...t-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/AggregateOperator.java

It is not clear to me why we add the conversion here. TBH the problem may be that is not clear to me when a value uses the stored type or the execution type.

What happens if we have two aggregation stages? Something like:

select distinct(c) from ( select A, count() as c from T group by A ) as TableExpr

Do we convert values twice?

Let me add more comments here. This conversion is not for internal/external type, but to solve the type difference between v1 and v2 engine. E.g. SUM in v1 always return double, but might return other types in v2 (standard SQL behavior). We use the TypeUtils as the bridge to convert v1 result to be used by v2.

Actually I found that we do call convertRow() twice, once in executor and once here.. Removing it from here since it is more clear to convert in the executor

gortiz · 2023-08-29T07:30:10Z

...runtime/src/main/java/org/apache/pinot/query/runtime/operator/MultistageGroupByExecutor.java

Some utility like isTrueValue(Object) would improve the legibility of these code.

+1, stored type should provide this utility function.

Added utils in BooleanUtils

gortiz · 2023-08-29T07:40:49Z

...ry-runtime/src/main/java/org/apache/pinot/query/runtime/operator/utils/AggregationUtils.java

is this logic correct? I may not have the context, but it looks like this method tries to implement the SQL and logic. In that case null and false is false, but null and true is null. We are not enforcing the later here.

Good catch. Let me fix it

xiangfu0 · 2023-08-29T09:45:14Z

.../src/main/java/org/apache/pinot/query/runtime/operator/operands/TransformOperandFactory.java

Did we missing handle IN and NOT IN or it's not required?

Probably not required. They are either pushed down, or planned as semi join

xiangfu0 · 2023-08-30T07:22:42Z

Huge applause on this unified work!

Jackie-Jiang added enhancement bugfix cleanup multi-stage Related to the multi-stage query engine labels Aug 29, 2023

Jackie-Jiang requested a review from xiangfu0 August 29, 2023 02:34

Jackie-Jiang force-pushed the data_type_wiring branch from da1a019 to cb32040 Compare August 29, 2023 03:13

Jackie-Jiang force-pushed the data_type_wiring branch from cb32040 to 09d4a39 Compare August 29, 2023 05:07

gortiz reviewed Aug 29, 2023

View reviewed changes

gortiz mentioned this pull request Aug 29, 2023

Error detected in base64 query when using multi stage engine #11420

Closed

xiangfu0 reviewed Aug 29, 2023

View reviewed changes

xiangfu0 approved these changes Aug 29, 2023

View reviewed changes

Jackie-Jiang force-pushed the data_type_wiring branch from 09d4a39 to fcab2bf Compare August 30, 2023 01:08

Jackie-Jiang added 3 commits August 29, 2023 20:06

Fix the data type handling in multi-stage engine

1455c5b

Address comments and bugfixes

f9daefe

Fix the cast() scalar function type conversion

81b67da

Jackie-Jiang force-pushed the data_type_wiring branch from fcab2bf to 81b67da Compare August 30, 2023 04:57

Jackie-Jiang merged commit 6c69be4 into apache:master Aug 30, 2023

Jackie-Jiang deleted the data_type_wiring branch August 30, 2023 06:20

KKcorps pushed a commit to KKcorps/incubator-pinot that referenced this pull request Sep 5, 2023

Fix the data type handling in multi-stage engine (apache#11453)

96f79e5

Jackie-Jiang mentioned this pull request Sep 30, 2023

Use string to represent BigDecimal response #11716

Merged

abhioncbr mentioned this pull request Oct 5, 2023

ScalarTransformFunctionWrapper throws exception for Object return type #11743

Open

Conversation

Jackie-Jiang commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gortiz Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiangfu0 commented Aug 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jackie-Jiang commented Aug 29, 2023 •

edited

Loading

codecov-commenter commented Aug 29, 2023 •

edited

Loading

gortiz Aug 29, 2023 •

edited

Loading