[multistage] bridge v2 query engine for leaf stage v1 multi-value column#11117
[multistage] bridge v2 query engine for leaf stage v1 multi-value column#11117xiangfu0 merged 8 commits intoapache:masterfrom
Conversation
df83612 to
35693aa
Compare
Codecov Report
@@ Coverage Diff @@
## master #11117 +/- ##
==========================================
- Coverage 0.11% 0.11% -0.01%
==========================================
Files 2218 2218
Lines 119113 119125 +12
Branches 18021 18023 +2
==========================================
Hits 137 137
- Misses 118956 118968 +12
Partials 20 20
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
it feels like the other way around right? multivalue_to_array? |
7134ae1 to
cc18f23
Compare
a0732bd to
bc3883d
Compare
pinot-common/src/main/java/org/apache/pinot/common/function/FunctionRegistry.java
Outdated
Show resolved
Hide resolved
...va/org/apache/pinot/core/operator/transform/function/ArrayToMultiValueTransformFunction.java
Outdated
Show resolved
Hide resolved
bc3883d to
ee8c112
Compare
...-tests/src/test/java/org/apache/pinot/integration/tests/MultiStageEngineIntegrationTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
IMO this is confusing name. the data type is already MV running an ARRAY_TO_MV is a bit weird.
should we named it USE_AS_MV? and we can say that MV columns are by default USE_AS_ARRAY
There was a problem hiding this comment.
USE_AS_MV might be confusing, since we repurposed MV as ARRAY in v2, ARRAY_TO_MV might be more explicit.
There was a problem hiding this comment.
yeah we need to figure out a proper way b/c out put of a select is an array, but the table config / schema will still call this as MV column as convension. both have some confusion, but as long as the document is proper we should be good.
There was a problem hiding this comment.
will it work if I run
SELECT count(*), arrayToMV(RandomAirports)
FROM mytable
WHERE Dest IN (SELECT Dest FROM myTable GROUP BY Dest HAVING count(*) > 10)
(later when we implemented the scalar function wrapper)
There was a problem hiding this comment.
You need a group by arrayToMV(RandomAirports) ?
There was a problem hiding this comment.
This only works at leaf stage not intermediate stage.
There was a problem hiding this comment.
the grand question is whether we continue to support MV as a data type or collapse into ARRAY.
if the answer to the above question is NO, this PR is not needed.
if the answer is YES. there are several concerns. i thought about it a bit last night. the problem is 2 fold.
- how do we support it conforming with ANSI syntax
- how do we make sure v1 perform the same as before.
for Q1: the answer is probably MULTI_SET
- it is a SET with no ordering but can contain duplicates, suitable for our use case for GROUP-BY and FILTER, which they are considered as expanded/unnested during eval.
- but not every use cases is considered as MULTI_SET, for example selection will consider MV as an ARRAY
therefore,
- an explicit
USE_AS_MULTISET(mv)is desired to bridge the syntatic gap - explicit
USE_AS_ARRAY(mv)is the default (considering select * there's no reason to ask user explicitly put this in
for Q2: the problem is these USE_AS_*** methods are considered transform, which shouldn't be the case there's a simply way to solve this --> in CalciteRexExpressionUtils, we can simply ignored the functionCall and directly return the operand as a reference in V1
b/c V1 is SqlNode and bares no type info, this means directly putting the operand input reference without the USE_AS_*** type conversion will work naturally with the V1 context.
so in short my proposal is
- have
USE_AS_ARRAYandUSE_AS_MULTISETas scalarFunction wrappers (unimplemented) to bridge the gap of the syntactic problem on calcite - in PhysicalPlanner explicitly exclude these syntatitic functions and directly drop to the operand;
- in the mid term we will have MV, MULTI_SET, ARRAY as 3 "co-existing" types, but in V1 there's only MV; in V2 MV is considered by default as ARRAY, unless
USE_AS_MULTISETis used. - in the long term, once we fully found all alternatives in standard SQL syntax for MV, we can safely consider MV as ARRAY. and only do MULTISET when necessary (i can only see it being useful in
MV > 5situation, where there's no array equivalent, only multiset equivalent)
WDYT? Please let me know
bc88977 to
252367f
Compare
MV column in V2 is modeled as ARRAY by default.
|
50b400d to
62e5d0c
Compare
|
Test failures requiring fix by #11151 |
There was a problem hiding this comment.
yeah we need to figure out a proper way b/c out put of a select is an array, but the table config / schema will still call this as MV column as convension. both have some confusion, but as long as the document is proper we should be good.
There was a problem hiding this comment.
let's first create a component return type registry on @ScalarFunction so we dont have to modify the transform function side.
There was a problem hiding this comment.
actually ignored previous comment, did a bit of research and it seems like registering TransformFunctionType without an actual impl is better than having to parse scalar function annotation, which doesn't really allow anything other than primitives
There was a problem hiding this comment.
I will have a different pr for this.
6e7e739 to
a16c4f0
Compare
a16c4f0 to
6d65102
Compare
|
rebase to master |
pinot-common/src/main/java/org/apache/pinot/common/function/FunctionRegistry.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Is function name always upper case?
There was a problem hiding this comment.
Let's canonicalized and just do ARRAY_TO_MV
pinot-query-planner/src/main/java/org/apache/pinot/query/parser/CalciteRexExpressionParser.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Let's canonicalized and just do ARRAY_TO_MV
0e81e15 to
e33c7a3
Compare
…luePredicateGenerators
e33c7a3 to
82d2a56
Compare
|
Part of this #10658 |
…umn (apache#11117) * [multistage] bridge v2 query engine for leaf stage v1 group by multi-value column * use multi-set * Change multi-value type back to array * rewrite arrayToMV at leaf stage * Enable more tests * fix integration tests with generated queries * Address comments * Take out MultiValueBetweenPredicateGenerator from _multistageSingleValuePredicateGenerators
Introduces a new function
arrayToMultiValueto bridge v2 query engine for leaf stage v1 group by multi-value column.For query:
SELECT count(*), RandomAirports FROM airlineStats GROUP BY RandomAirports, the plan isThe stacktrace is:
After the bridge, query is:
SELECT count(*), arrayToMV(RandomAirports) FROM airlineStats GROUP BY arrayToMV(RandomAirports)The plan is
Also it supports predicate for multi-value in v2: