Skip to content

Fix the data type handling in multi-stage engine#11453

Merged
Jackie-Jiang merged 3 commits intoapache:masterfrom
Jackie-Jiang:data_type_wiring
Aug 30, 2023
Merged

Fix the data type handling in multi-stage engine#11453
Jackie-Jiang merged 3 commits intoapache:masterfrom
Jackie-Jiang:data_type_wiring

Conversation

@Jackie-Jiang
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang commented Aug 29, 2023

  • Fix the wiring issue for value passing:
    • Use internal stored value type within the engine (same as single-stage engine):
      • BOOLEAN: int
      • TIMESTAMP: long
      • BYTES: ByteArray
      • BOOLEAN_ARRAY: int[]
      • TIMESTAMP_ARRAY: long[]
    • Only convert the value to external value type when used in UDF
  • Enhance leaf stage to only convert data types when necessary
  • Fix the wiring issue for null values
  • Correctly handle null in FilterOperand

@Jackie-Jiang Jackie-Jiang added enhancement bugfix cleanup multi-stage Related to the multi-stage query engine labels Aug 29, 2023
@Jackie-Jiang Jackie-Jiang requested a review from xiangfu0 August 29, 2023 02:34
@codecov-commenter
Copy link

codecov-commenter commented Aug 29, 2023

Codecov Report

Merging #11453 (81b67da) into master (9e0752c) will decrease coverage by 0.07%.
Report is 1 commits behind head on master.
The diff coverage is 49.03%.

@@             Coverage Diff              @@
##             master   #11453      +/-   ##
============================================
- Coverage     62.98%   62.92%   -0.07%     
- Complexity     1099     1108       +9     
============================================
  Files          2302     2302              
  Lines        124040   124087      +47     
  Branches      18903    18960      +57     
============================================
- Hits          78131    78084      -47     
- Misses        40362    40460      +98     
+ Partials       5547     5543       -4     
Flag Coverage Δ
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 62.88% <49.03%> (-0.08%) ⬇️
java-17 62.77% <49.03%> (-0.09%) ⬇️
java-20 62.77% <49.03%> (-0.09%) ⬇️
temurin 62.92% <49.03%> (-0.07%) ⬇️
unittests 62.92% <49.03%> (-0.07%) ⬇️
unittests1 67.41% <49.03%> (-0.12%) ⬇️
unittests2 14.48% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...ot/core/operator/blocks/InstanceResponseBlock.java 75.51% <ø> (ø)
...apache/pinot/query/planner/plannode/ValueNode.java 38.88% <0.00%> (+7.07%) ⬆️
.../apache/pinot/query/service/SubmissionService.java 84.61% <ø> (-3.62%) ⬇️
.../apache/pinot/common/datablock/DataBlockUtils.java 27.71% <14.49%> (-12.99%) ⬇️
...java/org/apache/pinot/common/utils/DataSchema.java 62.68% <16.92%> (-6.25%) ⬇️
...pache/pinot/query/planner/plannode/WindowNode.java 81.57% <40.00%> (-0.93%) ⬇️
.../java/org/apache/pinot/spi/utils/BooleanUtils.java 35.71% <40.00%> (-9.75%) ⬇️
.../pinot/query/runtime/operator/utils/TypeUtils.java 63.15% <58.82%> (-31.58%) ⬇️
.../pinot/query/planner/logical/LiteralHintUtils.java 81.48% <60.00%> (+13.91%) ⬆️
...sform/function/ScalarTransformFunctionWrapper.java 81.20% <62.50%> (-0.23%) ⬇️
... and 25 more

... and 15 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to move loops like this one to each own method. That would help the JIT to inline and optimize the loops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the loops are different (all combination of source type and dest type), and I guess the first thing JIT would do is to inline the method if we extract it out. We may revisit this in a separate thread

Copy link
Contributor

@gortiz gortiz Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know performance is not our priority but:

Depending on numRows it may be better to copy all values like in the not nullable case and then do a second loop where we nullify the null specific rows. Also, we can ask nullBitmap whether all values from rowId to rowId + numRows are null. That should be a very fast operation in roaring bitmaps and in case it happens, we can skip the whole loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually considered that, then decided not to do that for now because it might introduce much higher overhead for larger fields (e.g. STRING and BYTES). The changes made in this PR (most of boilerplate code is auto-generated by AI :-P) is guaranteed to be better than the old code by using tighter loop and reduce branching. We may open a separate thread for further performance improvement discussion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the jvm would inline these sub-methods in this one because this one is very large. But you are right, we can discuss about that later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of internal and external in this context? It would be great to either specify that here or to include a link to some definition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added more comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to my why do we use int as internal boolean value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments explaining it. The internal stored type for BOOLEAN is INT (similarly the stored type for TIMESTAMP is LONG), and we always use the stored type throughout the engine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, for example, we are wasting a lot of memory in case the array is not small. This is only used for multivalued booleans? If that I guess it is not that bad because it shouldn't be used that often

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. These conversion happens only when we need to switch between internal type and external type. During query time, that can happen during scalar function invocation and final result rendering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference here? Just that the casting is going to be more restrictive, right? I mean, performance should be the same but now it would fail if value is not exactly an Integer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main intention is not performance, but ensuring the wiring is correct and we can catch wrong wiring easily. Without this PR, a lot of wiring is wrong, but we are not able to catch them.
Added some comment explaining this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me why we add the conversion here. TBH the problem may be that is not clear to me when a value uses the stored type or the execution type.

What happens if we have two aggregation stages? Something like:

select distinct(c) from (
  select A, count() as c from T
    group by A
  ) as TableExpr

Do we convert values twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add more comments here. This conversion is not for internal/external type, but to solve the type difference between v1 and v2 engine. E.g. SUM in v1 always return double, but might return other types in v2 (standard SQL behavior). We use the TypeUtils as the bridge to convert v1 result to be used by v2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I found that we do call convertRow() twice, once in executor and once here.. Removing it from here since it is more clear to convert in the executor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some utility like isTrueValue(Object) would improve the legibility of these code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, stored type should provide this utility function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added utils in BooleanUtils

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this logic correct? I may not have the context, but it looks like this method tries to implement the SQL and logic. In that case null and false is false, but null and true is null. We are not enforcing the later here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Let me fix it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we missing handle IN and NOT IN or it's not required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not required. They are either pushed down, or planned as semi join

@Jackie-Jiang Jackie-Jiang merged commit 6c69be4 into apache:master Aug 30, 2023
@Jackie-Jiang Jackie-Jiang deleted the data_type_wiring branch August 30, 2023 06:20
@xiangfu0
Copy link
Contributor

Huge applause on this unified work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix cleanup enhancement multi-stage Related to the multi-stage query engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants