Fix the data type handling in multi-stage engine#11453
Fix the data type handling in multi-stage engine#11453Jackie-Jiang merged 3 commits intoapache:masterfrom
Conversation
da1a019 to
cb32040
Compare
Codecov Report
@@ Coverage Diff @@
## master #11453 +/- ##
============================================
- Coverage 62.98% 62.92% -0.07%
- Complexity 1099 1108 +9
============================================
Files 2302 2302
Lines 124040 124087 +47
Branches 18903 18960 +57
============================================
- Hits 78131 78084 -47
- Misses 40362 40460 +98
+ Partials 5547 5543 -4
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 15 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
cb32040 to
09d4a39
Compare
There was a problem hiding this comment.
I would recommend to move loops like this one to each own method. That would help the JIT to inline and optimize the loops
There was a problem hiding this comment.
All the loops are different (all combination of source type and dest type), and I guess the first thing JIT would do is to inline the method if we extract it out. We may revisit this in a separate thread
There was a problem hiding this comment.
nit: I know performance is not our priority but:
Depending on numRows it may be better to copy all values like in the not nullable case and then do a second loop where we nullify the null specific rows. Also, we can ask nullBitmap whether all values from rowId to rowId + numRows are null. That should be a very fast operation in roaring bitmaps and in case it happens, we can skip the whole loop.
There was a problem hiding this comment.
I actually considered that, then decided not to do that for now because it might introduce much higher overhead for larger fields (e.g. STRING and BYTES). The changes made in this PR (most of boilerplate code is auto-generated by AI :-P) is guaranteed to be better than the old code by using tighter loop and reduce branching. We may open a separate thread for further performance improvement discussion
There was a problem hiding this comment.
I don't think the jvm would inline these sub-methods in this one because this one is very large. But you are right, we can discuss about that later
There was a problem hiding this comment.
What is the meaning of internal and external in this context? It would be great to either specify that here or to include a link to some definition
There was a problem hiding this comment.
Good point. Added more comments
There was a problem hiding this comment.
It is not clear to my why do we use int as internal boolean value
There was a problem hiding this comment.
Added more comments explaining it. The internal stored type for BOOLEAN is INT (similarly the stored type for TIMESTAMP is LONG), and we always use the stored type throughout the engine
There was a problem hiding this comment.
here, for example, we are wasting a lot of memory in case the array is not small. This is only used for multivalued booleans? If that I guess it is not that bad because it shouldn't be used that often
There was a problem hiding this comment.
Correct. These conversion happens only when we need to switch between internal type and external type. During query time, that can happen during scalar function invocation and final result rendering.
There was a problem hiding this comment.
Is there any difference here? Just that the casting is going to be more restrictive, right? I mean, performance should be the same but now it would fail if value is not exactly an Integer
There was a problem hiding this comment.
The main intention is not performance, but ensuring the wiring is correct and we can catch wrong wiring easily. Without this PR, a lot of wiring is wrong, but we are not able to catch them.
Added some comment explaining this
There was a problem hiding this comment.
It is not clear to me why we add the conversion here. TBH the problem may be that is not clear to me when a value uses the stored type or the execution type.
What happens if we have two aggregation stages? Something like:
select distinct(c) from (
select A, count() as c from T
group by A
) as TableExpr
Do we convert values twice?
There was a problem hiding this comment.
Let me add more comments here. This conversion is not for internal/external type, but to solve the type difference between v1 and v2 engine. E.g. SUM in v1 always return double, but might return other types in v2 (standard SQL behavior). We use the TypeUtils as the bridge to convert v1 result to be used by v2.
There was a problem hiding this comment.
Actually I found that we do call convertRow() twice, once in executor and once here.. Removing it from here since it is more clear to convert in the executor
There was a problem hiding this comment.
Some utility like isTrueValue(Object) would improve the legibility of these code.
There was a problem hiding this comment.
+1, stored type should provide this utility function.
There was a problem hiding this comment.
Added utils in BooleanUtils
There was a problem hiding this comment.
is this logic correct? I may not have the context, but it looks like this method tries to implement the SQL and logic. In that case null and false is false, but null and true is null. We are not enforcing the later here.
There was a problem hiding this comment.
Good catch. Let me fix it
There was a problem hiding this comment.
Did we missing handle IN and NOT IN or it's not required?
There was a problem hiding this comment.
Probably not required. They are either pushed down, or planned as semi join
09d4a39 to
fcab2bf
Compare
fcab2bf to
81b67da
Compare
|
Huge applause on this unified work! |
nullvaluesnullinFilterOperand