Skip to content

[SPARK-38887][SQL] Support switch inner join side for sort merge join #36180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

Enhance the selection of SortMergeJoin streamed side and buffered side in JoinSelection, so that we can get a buffered side for better performance.

Why are the changes needed?

For an inner join type SortMergeJoin, it always uses the left side as streamed side and right side as buffered side.

Accoirding to the implementaion of SortMergeJoin, we expect the buffered side to be:

  • smaller than streamed side
  • has less duplicate data

We do not know whether the join will be SortMergeJoin at logical phase, so it should do this selection at physcial phase.

For example:

SELECT * FROM (
  SELECT c1 FROM a GROUP BY c1 
) t1 JOIN t2 ON t1.c1 = t2.c2

We can make t1 as the buffered side if its statistics size less than t2.

Does this PR introduce any user-facing change?

The plan may be changed and improve the performance.

How was this patch tested?

add test

@github-actions github-actions bot added the SQL label Apr 13, 2022
@ulysses-you ulysses-you marked this pull request as draft April 13, 2022 14:43
leftKeys, rightKeys, joinType, nonEquiCond, planLater(left), planLater(right))))
val (streamed, buffered) = getSortMergeJoinStreamedAndBufferedSide(joinType, leftKeys,
left, right, conf)
Some(Seq(joins.SortMergeJoinExec(leftKeys, rightKeys, joinType, nonEquiCond,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: does the order of output attributes matter?

iiuc, BHJ has an BuildSide enum and doesn't actually switch left/right children so it doesn't have this problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think we should change it back to the original smj output ordering after we switch join side.

@ulysses-you ulysses-you force-pushed the SPARK-38887 branch 4 times, most recently from 8e7b834 to bd41898 Compare April 19, 2022 12:40
@ulysses-you
Copy link
Contributor Author

ulysses-you commented Apr 19, 2022

Some performance number with the benchmark:

 OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Mac OS X 12.3.1
 Apple M1 Pro 
 sort merge join with buffered side duplicates, switched: true,:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------------------------------------------
 sort merge join with buffered side duplicates, switched: true, wholestage off           5175           5196          30          6.5         154.2       1.0X
 sort merge join with buffered side duplicates, switched: true, wholestage on            4991           5221         407          6.7         148.7       1.0X


 sort merge join with buffered side duplicates, switched: false,:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
 sort merge join with buffered side duplicates, switched: false, wholestage off           6291           6305          20          5.3         187.5       1.0X
 sort merge join with buffered side duplicates, switched: false, wholestage on            6635           6761         126          5.1         197.7       0.9X

@ulysses-you ulysses-you marked this pull request as ready for review April 19, 2022 12:41
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 24, 2022
@github-actions github-actions bot closed this Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants