-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35984][SQL][TEST] Config to force applying shuffled hash join #33182
[SPARK-35984][SQL][TEST] Config to force applying shuffled hash join #33182
Conversation
cc @cloud-fan |
Test build #140559 has finished for PR 33182 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test status success |
Test build #140556 has finished for PR 33182 at commit
|
.doc("When true, force applying shuffled hash join even if the table sizes exceed the " + | ||
"threshold. This is for testing/benchmarking only. If this config is set to true, the " + | ||
"value spark.sql.join.perferSortMergejoin will be ignored.") | ||
.version("3.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we are on 3.3.0
now I think?
.internal() | ||
.doc("When true, force applying shuffled hash join even if the table sizes exceed the " + | ||
"threshold. This is for testing/benchmarking only. If this config is set to true, the " + | ||
"value spark.sql.join.perferSortMergejoin will be ignored.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: PREFER_SORTMERGEJOIN.key
instead of spark.sql.join.perferSortMergejoin
.
@@ -272,14 +272,14 @@ trait JoinSelectionHelper { | |||
val buildLeft = if (hintOnly) { | |||
hintToShuffleHashJoinLeft(hint) | |||
} else { | |||
hintToPreferShuffleHashJoinLeft(hint) || | |||
hintToPreferShuffleHashJoinLeft(hint) || conf.forceApplyShuffledHashJoin || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't want user to use this config, and this should be only taking effect in testing right? Should we add condition e.g. Utils.isTesting
?
16a0791
to
f3474a2
Compare
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140613 has finished for PR 33182 at commit
|
@@ -419,6 +419,15 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val FORCE_APPLY_SHUFFLEDHASHJOIN = buildConf("spark.sql.join.forceApplyShuffledHashJoin") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just hardcode test-only configs.
Kubernetes integration test starting |
Kubernetes integration test status success |
@@ -274,14 +275,17 @@ trait JoinSelectionHelper { | |||
} else { | |||
hintToPreferShuffleHashJoinLeft(hint) || | |||
(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(left, conf) && | |||
muchSmaller(left, right)) | |||
muchSmaller(left, right)) || | |||
(Utils.isTesting && forceApplyShuffledHashJoin(conf)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can even move Utils.isTesting
into forceApplyShuffledHashJoin
Test build #140701 has finished for PR 33182 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
thanks, merging to master/3.2 (to improve test coverage) |
### What changes were proposed in this pull request? Add a config `spark.sql.join.forceApplyShuffledHashJoin` to force applying shuffled hash join during the join selection. ### Why are the changes needed? In the `SQLQueryTestSuite`, we want to cover 3 kinds of join (BHJ, SHJ, SMJ) in join.sql. But even if the `spark.sql.join.preferSortMergeJoin` is set to `false`, shuffled hash join is still not guaranteed. Thus, we need another config to force the selection. ### Does this PR introduce _any_ user-facing change? No, only for testing ### How was this patch tested? newly added tests Verified all queries in join.sql will use `ShuffledHashJoin` when the config set to `true` Closes #33182 from linhongliu-db/SPARK-35984-hash-join-config. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7566db6) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Test build #140710 has finished for PR 33182 at commit
|
…in test in-joins.sql ### What changes were proposed in this pull request? We found the `in-join.sql` does not test shuffled hash join properly in https://issues.apache.org/jira/browse/SPARK-32577, but didn't find a good way to fix it. Given we now have a test config to enforce shuffled hash join in #33182, we can fix the test here now as well. ### Why are the changes needed? Fix test to have better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Reran the test to compare the output, and verified the query plan manually to make sure shuffled hash join being used. Closes #33236 from c21/join-test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…in test in-joins.sql ### What changes were proposed in this pull request? We found the `in-join.sql` does not test shuffled hash join properly in https://issues.apache.org/jira/browse/SPARK-32577, but didn't find a good way to fix it. Given we now have a test config to enforce shuffled hash join in #33182, we can fix the test here now as well. ### Why are the changes needed? Fix test to have better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Reran the test to compare the output, and verified the query plan manually to make sure shuffled hash join being used. Closes #33236 from c21/join-test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f3c1159) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Add a config `spark.sql.join.forceApplyShuffledHashJoin` to force applying shuffled hash join during the join selection. In the `SQLQueryTestSuite`, we want to cover 3 kinds of join (BHJ, SHJ, SMJ) in join.sql. But even if the `spark.sql.join.preferSortMergeJoin` is set to `false`, shuffled hash join is still not guaranteed. Thus, we need another config to force the selection. No, only for testing newly added tests Verified all queries in join.sql will use `ShuffledHashJoin` when the config set to `true` Closes #33182 from linhongliu-db/SPARK-35984-hash-join-config. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Add a config
spark.sql.join.forceApplyShuffledHashJoin
to force applying shuffled hash joinduring the join selection.
Why are the changes needed?
In the
SQLQueryTestSuite
, we want to cover 3 kinds of join (BHJ, SHJ, SMJ) in join.sql. But evenif the
spark.sql.join.preferSortMergeJoin
is set tofalse
, shuffled hash join is still not guaranteed.Thus, we need another config to force the selection.
Does this PR introduce any user-facing change?
No, only for testing
How was this patch tested?
newly added tests
Verified all queries in join.sql will use
ShuffledHashJoin
when the config set totrue