Skip to content

[SPARK-37592][SQL] Improve performance of JoinSelection #34844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Dec 9, 2021

What changes were proposed in this pull request?

When I reading the implement of AQE, I find the process select join with hint exists a lot cumbersome code.

The join hint has a relatively high learning curve for users, so the SQL not contains join hint in more cases.

Why are the changes needed?

Improve performance of JoinSelection

Does this PR introduce any user-facing change?

'No'.
Just change the inner implement.

How was this patch tested?

Jenkins test.

@github-actions github-actions bot added the SQL label Dec 9, 2021
@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50507/

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50507/

@SparkQA
Copy link

SparkQA commented Dec 9, 2021

Test build #146031 has finished for PR 34844 at commit aa49f15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Dec 9, 2021

ping @cloud-fan

@@ -548,11 +548,9 @@ case class ApplyColumnarRulesAndInsertTransitions(

def apply(plan: SparkPlan): SparkPlan = {
var preInsertPlan: SparkPlan = plan
columnarRules.foreach((r : ColumnarRule) =>
preInsertPlan = r.preColumnarTransitions(preInsertPlan))
columnarRules.foreach( r => preInsertPlan = r.preColumnarTransitions(preInsertPlan))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this related to JoinSelection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just simplify the code.

.orElse { if (hintToShuffleReplicateNL(hint)) createCartesianProduct() else None }
.getOrElse(createJoinWithoutHint())
if (hint.isEmpty) {
createJoinWithoutHint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this in case logical.Join(left, right, joinType, condition, hint) ... as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50533/

@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50533/

@SparkQA
Copy link

SparkQA commented Dec 10, 2021

Test build #146058 has finished for PR 34844 at commit 56919dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

beliefer and others added 2 commits December 13, 2021 14:05
…r.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
…r.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50595/

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50595/

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Test build #146120 has finished for PR 34844 at commit 3ce77ee.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50611/

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50611/

@SparkQA
Copy link

SparkQA commented Dec 13, 2021

Test build #146137 has finished for PR 34844 at commit 3ce77ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 77b164a Dec 14, 2021
@beliefer
Copy link
Contributor Author

@cloud-fan Thanks a lot!

@dongjoon-hyun
Copy link
Member

+1, LGTM.

wangyum pushed a commit that referenced this pull request May 26, 2023
* [SPARK-36992][SQL] Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray

### What changes were proposed in this pull request?

Unify the getPrefix function of `UTF8String` and `ByteArray`.

### Why are the changes needed?

When execute sort operator, we first compare the prefix. However the getPrefix function of byte array is slow. We use first 8 bytes as the prefix, so at most we will call 8 times with `Platform.getByte` which is slower than call once with `Platform.getInt` or `Platform.getLong`.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

pass `org.apache.spark.util.collection.unsafe.sort.PrefixComparatorsSuite`

Closes #34267 from ulysses-you/binary-prefix.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

* [SPARK-37037][SQL] Improve byte array sort by unify compareTo function of UTF8String and ByteArray

### What changes were proposed in this pull request?

Unify the compare function of `UTF8String` and `ByteArray`.

### Why are the changes needed?

`BinaryType` use `TypeUtils.compareBinary` to compare two byte array, however it's slow since it compares byte array using unsigned int comparison byte by bye.

We can compare them using `Platform.getLong` with unsigned long comparison if they have more than 8 bytes. And here is some histroy about this `TODO` https://github.com/apache/spark/pull/6755/files#r32197461

The benchmark result should be same with `UTF8String`, can be found in #19180 (#19180 (comment))

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Move test from `TypeUtilsSuite` to `ByteArraySuite`

Closes #34310 from ulysses-you/SPARK-37037.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

* [SPARK-37341][SQL] Avoid unnecessary buffer and copy in full outer sort merge join

### What changes were proposed in this pull request?

FULL OUTER sort merge join (non-code-gen path) [copies join keys and buffers input rows, even when rows from both sides do not have matched keys](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641). This is unnecessary, as we can just output the row with smaller join keys, and only buffer when both sides have matched keys. This would save us from unnecessary copy and buffer, when both join sides have a lot of rows not matched with each other.

### Why are the changes needed?

Improve query performance for FULL OUTER sort merge join when code-gen is disabled.
This would benefit query when both sides have a lot of rows not matched, and join key is big in terms of size (e.g. string type).

Example micro benchmark:

```
  def sortMergeJoin(): Unit = {
    val N = 2 << 20
    codegenBenchmark("sort merge join", N) {
      val df1 = spark.range(N).selectExpr(s"cast(id * 15485863 as string) as k1")
      val df2 = spark.range(N).selectExpr(s"cast(id * 15485867 as string) as k2")
      val df = df1.join(df2, col("k1") === col("k2"), "full_outer")
      assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
      df.noop()
    }
  }
```

Seeing run-time improvement over 60%:

```
Running benchmark: sort merge join
  Running case: sort merge join without optimization
  Stopped after 5 iterations, 10026 ms
  Running case: sort merge join with optimization
  Stopped after 5 iterations, 5954 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
sort merge join:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
sort merge join without optimization               1807           2005         157          1.2         861.4       1.0X
sort merge join with optimization                  1135           1191          62          1.8         541.1       1.6X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests e.g. `OuterJoinSuite.scala`.

Closes #34612 from c21/smj-fix.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val

### What changes were proposed in this pull request?

This PR adds caching to `LogicalPlan.isStreaming()`: the default implementation's result will now be cached in a `private lazy val`.

### Why are the changes needed?

This improves the performance of the `DeduplicateRelations` analyzer rule.

The default implementation of `isStreaming` recursively visits every node in the tree. `DeduplicateRelations.renewDuplicatedRelations` is recursively invoked on every node in the tree and each invocation calls `isStreaming`. This leads to `O(n^2)` invocations of `isStreaming` on leaf nodes.

Caching `isStreaming` avoids this performance problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Correctness should be covered by existing tests.

This significantly improved `DeduplicateRelations` performance in local microbenchmarking with large query plans (~20% reduction in that rule's runtime in one of my tests).

Closes #34691 from JoshRosen/cache-LogicalPlan.isStreaming.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-37530][CORE] Spark reads many paths very slow though newAPIHadoopFile

### What changes were proposed in this pull request?

Same as #18441, we parallelize FileInputFormat.listStatus for newAPIHadoopFile

### Why are the changes needed?

![image](https://user-images.githubusercontent.com/8326978/144562490-d8005bf2-2052-4b50-9a5d-8b253ee598cc.png)

Spark can be slow when accessing external storage at driver side, improve perf by parallelizing

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

passing GA

Closes #34792 from yaooqinn/SPARK-37530.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>

* [SPARK-37592][SQL] Improve performance of `JoinSelection`

When I reading the implement of AQE, I find the process select join with hint exists a lot cumbersome code.

The join hint has a relatively high learning curve for users, so the SQL not  contains join hint in more cases.

Improve performance of `JoinSelection`

'No'.
Just change the inner implement.

Jenkins test.

Closes #34844 from beliefer/SPARK-37592-new.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-37646][SQL] Avoid touching Scala reflection APIs in the lit function

### What changes were proposed in this pull request?

This PR proposes to avoid touching Scala reflection APIs in the lit function.

### Why are the changes needed?

Currently `lit` calls `typedlit[Any]` and touches Scala reflection APIs unnecessarily. As Scala reflection APIs touch multiple global locks and they are pretty slow when the parallelism is pretty high.

This PR inlines `typedlit` to `lit` and replaces `Literal.create` with `Literal.apply` to avoid touching Scala reflection APIs. There is no behavior change.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- New unit tests.
- Manually ran the test in https://issues.apache.org/jira/browse/SPARK-37646 and saw no difference between `new Column(Literal(0L))` and `lit(0L)`.

Closes #34901 from zsxwing/SPARK-37646.

Lead-authored-by: Shixiong Zhu <zsxwing@gmail.com>
Co-authored-by: Shixiong Zhu <shixiong@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

* [SPARK-37689][SQL] Expand should be supported in PropagateEmptyRelation

We  meet a case that when there is a empty relation, HashAggregateExec still triggered to execute and return an empty result. It's not necessary.
![image](https://user-images.githubusercontent.com/46485123/146725110-27496536-f1f7-4fac-ae2c-0f6f81159bba.png)
It's caused by there is an  `Expand(EmptyLocalRelation())`, and it's not propagated,  this pr support propagate `Expand` with empty LocalRelation

Avoid unnecessary execution.

No

Added UT

Closes #34954 from AngersZhuuuu/SPARK-37689.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-36406][CORE] Avoid unnecessary file operations before delete a write failed file held by DiskBlockObjectWriter

We always do file truncate operation before delete a write failed file held by `DiskBlockObjectWriter`, a typical process is as follows:

```
if (!success) {
  // This code path only happens if an exception was thrown above before we set success;
  // close our stuff and let the exception be thrown further
  writer.revertPartialWritesAndClose()
  if (file.exists()) {
    if (!file.delete()) {
      logWarning(s"Error deleting ${file}")
    }
  }
}
```
The `revertPartialWritesAndClose` method will reverts writes that haven't been committed yet,  but it doesn't seem necessary in the current scene.

So this pr add a new method  to `DiskBlockObjectWriter` named `closeAndDelete()`,  the new method just revert write metrics and delete the write failed file.

Avoid unnecessary file operations.

Add a new method  to `DiskBlockObjectWriter` named `closeAndDelete().

Pass the Jenkins or GitHub Action

Closes #33628 from LuciferYang/SPARK-36406.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>

* [SPARK-37462][CORE] Avoid unnecessary calculating the number of outstanding fetch requests and RPCS

Avoid unnecessary calculating the number of outstanding fetch requests and RPCS

It is unnecessary to calculate the number of outstanding fetch requests and RPCS when the IdleStateEvent is not IDLE or the last request is not timeout.

No.
Exist unittests.

Closes #34711 from weixiuli/SPARK-37462.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

Co-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Co-authored-by: Cheng Su <chengsu@fb.com>
Co-authored-by: Josh Rosen <joshrosen@databricks.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Shixiong Zhu <zsxwing@gmail.com>
Co-authored-by: Shixiong Zhu <shixiong@databricks.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: weixiuli <weixiuli@jd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants