Skip to content

[multistage] support inequality JOIN#9448

Merged
siddharthteotia merged 3 commits intoapache:masterfrom
walterddr:pr_inequality_join
Sep 29, 2022
Merged

[multistage] support inequality JOIN#9448
siddharthteotia merged 3 commits intoapache:masterfrom
walterddr:pr_inequality_join

Conversation

@walterddr
Copy link
Contributor

@walterddr walterddr commented Sep 22, 2022

support inequality join with nested loop.

adding nested loop directly on hash join algorithm

  • this is not the most efficient way to join as when there's no equality join key we should not use hash join anyway
  • when there's no join key, use broadcast assuming the right table is smaller.

TODO

  • implement other join algorithms

@walterddr walterddr marked this pull request as ready for review September 22, 2022 21:10
@codecov-commenter
Copy link

codecov-commenter commented Sep 22, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.36%. Comparing base (2daa863) to head (022f5ef).
Report is 2961 commits behind head on master.

Additional details and impacted files
@@              Coverage Diff              @@
##             master    #9448       +/-   ##
=============================================
+ Coverage     35.09%   68.36%   +33.27%     
- Complexity      189     5144     +4955     
=============================================
  Files          1915     1915               
  Lines        101994   102056       +62     
  Branches      15468    15481       +13     
=============================================
+ Hits          35791    69768    +33977     
+ Misses        63158    27338    -35820     
- Partials       3045     4950     +1905     
Flag Coverage Δ
integration1 ?
integration2 24.68% <0.00%> (?)
unittests1 67.11% <100.00%> (?)
unittests2 15.53% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewing for my own personal learning and ramp-up :) feel free to disregard any comments that don't make sense.

LGTM though


if (joinInfo.leftKeys.isEmpty()) {
// when there's no JOIN key, use broadcast.
leftExchange = LogicalExchange.create(leftInput, RelDistributions.SINGLETON);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure that I'm understanding this correctly - using SINGLETON on the left and BROADCAST_DISTRIBUTED on the right would mean that each row in the right table is broadcasted to every node that hosts a segment of the left table, and the left tables remain in place (essentially not exchanged at all)?

when there's no join key, use broadcast assuming the right table is smaller.

I noticed that there's a computeSelfCost on RelNode which wires down to basically getting a row count. Is there a way for us to hook that in to make sure that we're actually choosing the smaller table? Obviously this can be left for a follow-up, just wanted to learn :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cost is not enabled. it is either zero or infinite right now. and yes the assumption is correct.

// push a filter into a join, replaced CoreRules.FILTER_INTO_JOIN with special config
PinotFilterIntoJoinRule.INSTANCE,
// push a filter into a join
CoreRules.FILTER_INTO_JOIN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an aside, how do we choose which rules to use here? I notice we don't include, for example, SORT_JOIN_TRANSPOSE which can push sorts down in the case of LEFT/RIGHT OUTER JOIN but that isn't here?

Copy link
Contributor Author

@walterddr walterddr Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the rules "available but not use" generally means one of the other

  • they work best with cost factory that's not a dummy
  • they might cause sideeffect we don't want

thus we leave that out due to "preserve correctness over performant" rule

for (Object[] rightRow : hashCollection) {
rows.add(joinRow(leftRow, rightRow));
Object[] resultRow = joinRow(leftRow, rightRow);
if (_joinClauseEvaluators.isEmpty() || _joinClauseEvaluators.stream().allMatch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like _joinClauseEvaluators.isEmpty() is true if there are only non-equijoin conditions - I didn't fully look into calcite code, but how does it handle a situation with both equi and non-equi joins (e.g. JOIN ON a.col1 = b.col1 AND a.col2 > b.col2)?

can we add a test for that scenario?

Copy link
Contributor Author

@walterddr walterddr Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calcite automatically splits into join keys and non-eq join conditions; we already have a test for this. see QueryRunnerTest (remember WHERE clause pulled up is applied if predicate involve columns on both table)

@siddharthteotia siddharthteotia merged commit 3057712 into apache:master Sep 29, 2022
61yao pushed a commit to 61yao/pinot that referenced this pull request Oct 3, 2022
* support inequality JOIN

* also support pure inequality join

* address diff comment and add a SEMI join test case for this as well

Co-authored-by: Rong Rong <rongr@startree.ai>
@walterddr walterddr deleted the pr_inequality_join branch December 6, 2023 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants