Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35984][SQL][TEST] Config to force applying shuffled hash join #33182

Closed

Conversation

linhongliu-db
Copy link
Contributor

@linhongliu-db linhongliu-db commented Jul 2, 2021

What changes were proposed in this pull request?

Add a config spark.sql.join.forceApplyShuffledHashJoin to force applying shuffled hash join
during the join selection.

Why are the changes needed?

In the SQLQueryTestSuite, we want to cover 3 kinds of join (BHJ, SHJ, SMJ) in join.sql. But even
if the spark.sql.join.preferSortMergeJoin is set to false, shuffled hash join is still not guaranteed.
Thus, we need another config to force the selection.

Does this PR introduce any user-facing change?

No, only for testing

How was this patch tested?

newly added tests
Verified all queries in join.sql will use ShuffledHashJoin when the config set to true

@github-actions github-actions bot added the SQL label Jul 2, 2021
@linhongliu-db
Copy link
Contributor Author

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Test build #140559 has finished for PR 33182 at commit 16a0791.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45068/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45071/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45068/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45071/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Test build #140556 has finished for PR 33182 at commit 24b39a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.doc("When true, force applying shuffled hash join even if the table sizes exceed the " +
"threshold. This is for testing/benchmarking only. If this config is set to true, the " +
"value spark.sql.join.perferSortMergejoin will be ignored.")
.version("3.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we are on 3.3.0 now I think?

.internal()
.doc("When true, force applying shuffled hash join even if the table sizes exceed the " +
"threshold. This is for testing/benchmarking only. If this config is set to true, the " +
"value spark.sql.join.perferSortMergejoin will be ignored.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: PREFER_SORTMERGEJOIN.key instead of spark.sql.join.perferSortMergejoin.

@@ -272,14 +272,14 @@ trait JoinSelectionHelper {
val buildLeft = if (hintOnly) {
hintToShuffleHashJoinLeft(hint)
} else {
hintToPreferShuffleHashJoinLeft(hint) ||
hintToPreferShuffleHashJoinLeft(hint) || conf.forceApplyShuffledHashJoin ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't want user to use this config, and this should be only taking effect in testing right? Should we add condition e.g. Utils.isTesting?

@SparkQA
Copy link

SparkQA commented Jul 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45126/

@SparkQA
Copy link

SparkQA commented Jul 3, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45126/

@SparkQA
Copy link

SparkQA commented Jul 3, 2021

Test build #140613 has finished for PR 33182 at commit f3474a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -419,6 +419,15 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val FORCE_APPLY_SHUFFLEDHASHJOIN = buildConf("spark.sql.join.forceApplyShuffledHashJoin")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just hardcode test-only configs.

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45212/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45212/

@@ -274,14 +275,17 @@ trait JoinSelectionHelper {
} else {
hintToPreferShuffleHashJoinLeft(hint) ||
(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(left, conf) &&
muchSmaller(left, right))
muchSmaller(left, right)) ||
(Utils.isTesting && forceApplyShuffledHashJoin(conf))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can even move Utils.isTesting into forceApplyShuffledHashJoin

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Test build #140701 has finished for PR 33182 at commit d0dfd8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45220/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45220/

@cloud-fan cloud-fan changed the title [SPARK-35984][SQL] Config to force applying shuffled hash join [SPARK-35984][SQL][TEST] Config to force applying shuffled hash join Jul 6, 2021
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2 (to improve test coverage)

@cloud-fan cloud-fan closed this in 7566db6 Jul 6, 2021
cloud-fan pushed a commit that referenced this pull request Jul 6, 2021
### What changes were proposed in this pull request?
Add a config `spark.sql.join.forceApplyShuffledHashJoin` to force applying shuffled hash join
during the join selection.

### Why are the changes needed?
In the `SQLQueryTestSuite`, we want to cover 3 kinds of join (BHJ, SHJ, SMJ) in join.sql. But even
if the `spark.sql.join.preferSortMergeJoin` is set to `false`, shuffled hash join is still not guaranteed.
Thus, we need another config to force the selection.

### Does this PR introduce _any_ user-facing change?
No, only for testing

### How was this patch tested?
newly added tests
Verified all queries in join.sql will use `ShuffledHashJoin` when the config set to `true`

Closes #33182 from linhongliu-db/SPARK-35984-hash-join-config.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 7566db6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Test build #140710 has finished for PR 33182 at commit 81fdaae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon pushed a commit that referenced this pull request Jul 7, 2021
…in test in-joins.sql

### What changes were proposed in this pull request?

We found the `in-join.sql` does not test shuffled hash join properly in https://issues.apache.org/jira/browse/SPARK-32577, but didn't find a good way to fix it. Given we now have a test config to enforce shuffled hash join in #33182, we can fix the test here now as well.

### Why are the changes needed?

Fix test to have better test coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Reran the test to compare the output, and verified the query plan manually to make sure shuffled hash join being used.

Closes #33236 from c21/join-test.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Jul 7, 2021
…in test in-joins.sql

### What changes were proposed in this pull request?

We found the `in-join.sql` does not test shuffled hash join properly in https://issues.apache.org/jira/browse/SPARK-32577, but didn't find a good way to fix it. Given we now have a test config to enforce shuffled hash join in #33182, we can fix the test here now as well.

### Why are the changes needed?

Fix test to have better test coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Reran the test to compare the output, and verified the query plan manually to make sure shuffled hash join being used.

Closes #33236 from c21/join-test.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit f3c1159)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
wangyum pushed a commit that referenced this pull request May 26, 2023
Add a config `spark.sql.join.forceApplyShuffledHashJoin` to force applying shuffled hash join
during the join selection.

In the `SQLQueryTestSuite`, we want to cover 3 kinds of join (BHJ, SHJ, SMJ) in join.sql. But even
if the `spark.sql.join.preferSortMergeJoin` is set to `false`, shuffled hash join is still not guaranteed.
Thus, we need another config to force the selection.

No, only for testing

newly added tests
Verified all queries in join.sql will use `ShuffledHashJoin` when the config set to `true`

Closes #33182 from linhongliu-db/SPARK-35984-hash-join-config.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants