[SPARK-16406][SQL] Improve performance of LogicalPlan.resolve #14083

hvanhovell · 2016-07-06T23:59:34Z

What changes were proposed in this pull request?

LogicalPlan.resolve(...) uses linear searches to find an attribute matching a name. This is fine in normal cases, but gets problematic when you try to resolve a large number of columns on a plan with a large number of attributes.

This PR adds an indexing structure to resolve(...) in order to find potential matches quicker. This PR improves the reference resolution time for the following code by 4x (11.8s -> 2.4s):

val n = 4000
val values = (1 to n).map(_.toString).mkString(", ")
val columns = (1 to n).map("column" + _).mkString(", ")
val query =
  s"""
     |SELECT $columns
     |FROM VALUES ($values) T($columns)
     |WHERE 1=2 AND 1 IN ($columns)
     |GROUP BY $columns
     |ORDER BY $columns
     |""".stripMargin

spark.time(sql(query))

How was this patch tested?

Existing tests.

hvanhovell · 2016-07-06T23:59:46Z

cc @dongjoon-hyun

dongjoon-hyun · 2016-07-07T00:04:17Z

Wow! And, thank you for pinging me. :)

SparkQA · 2016-07-07T00:07:33Z

Test build #61882 has finished for PR 14083 at commit b4c8cdb.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-07T00:18:09Z

Oh, it's strange. It still fails at compiling in my local laptop.

SparkQA · 2016-07-07T00:23:07Z

Test build #61884 has finished for PR 14083 at commit a8a6a83.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-07-07T00:28:02Z

messed up the merge - lemme fix that

SparkQA · 2016-07-07T01:48:28Z

Test build #61886 has finished for PR 14083 at commit 3e39720.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-07-07T03:49:34Z

The codes in the description seems incomplete?

hvanhovell · 2016-07-07T03:54:21Z

@viirya you mean I forgot to add time(sql(query))?

viirya · 2016-07-07T03:56:17Z

yea. :-)

viirya · 2016-07-07T04:12:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

-    } else {
-      None
+  /** Map to use for direct case insensitive attribute lookups. */
+  private val direct: Map[String, Seq[Attribute]] = {


nit: do we need to use lazy val?

You have a point: it is the secondary code path, so it is less likely to be used. I'll take a look at it on my next pass.

dongjoon-hyun · 2016-07-07T06:31:54Z

I've got another idea (orthogonal to this). Thank you, @hvanhovell .

jaceklaskowski · 2016-07-07T20:42:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

+
+/**
+ * Helper class for (LogicalPlan) attribute resolution. This class indexes attributes by their
+ * case-in-sensitive name, and checks potential candidates using the given Resolver. Both qualified


case-insensitive? When you say "the given Resolver", what do you mean by "Resolver"? Can we link to the type?

The resolve methods takes a Resolver as its parameter. This allows us to use either case sensitive or insensitive attribute resolution depending on the Resolver passed. The names of both classes are confusing and I might rename the AttributeResolver class to AttributeIndex or something like that...

The AttributeResolver creates two indexes based on the lower case (qualified) attribute name; we do an initial lookup based on the lower case name, and then use the Resolver for the actual attribute selection. This allows us to do fast(er) and correct lookups.

SparkQA · 2016-07-07T20:47:49Z

Test build #61923 has finished for PR 14083 at commit c75ae8d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-08T05:34:39Z

Test build #61952 has finished for PR 14083 at commit a5d1a4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-08T05:58:59Z

Finally! Congrat!

JoshRosen · 2016-09-01T18:14:47Z

Just came across the old PR. Because this is similar in spirit to #13505, I wonder whether it might make sense to put some of this logic into AttributeSeq:

https://github.com/apache/spark/blame/99386fe3989f758844de14b2c28eccfdf8163221/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91

hvanhovell · 2016-09-11T20:33:13Z

@JoshRosen I have moved the implementation into AttributeSeq.

SparkQA · 2016-09-11T22:31:54Z

Test build #65231 has finished for PR 14083 at commit a1e5312.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-06-16T23:25:07Z

cc @hvanhovell

gatorsmile · 2017-10-28T05:04:45Z

Close it? @hvanhovell

hvanhovell · 2018-04-20T15:07:09Z

@cloud-fan can you take a look?

SparkQA · 2018-04-20T15:13:00Z

Test build #89636 has finished for PR 14083 at commit 85e0744.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-20T18:12:27Z

Test build #89637 has finished for PR 14083 at commit eb6dd80.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-23T02:31:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

+          // Then this will add ExtractValue("c", ExtractValue("b", a)), and alias the final
+          // expression as "c".
+          val fieldExprs = nestedFields.foldLeft(a: Expression) { (e, name) =>
+            ExtractValue(e, Literal(name), resolver)


ExtractValue has the same perf problem, but this can be fixed in a follow up

Is there an issue for the follow up?

@heuermh I have not filed the issue for this. Do you want to work on this?

cloud-fan · 2018-04-23T02:32:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

+
+        case ambiguousReferences =>
+          // More than one match.
+          val referenceNames = ambiguousReferences.mkString(", ")


to pass the test, we should follow the previous code: ambiguousReferences.map(_._1.qualifiedName).mkString(", ")

cloud-fan · 2018-04-23T02:33:08Z

LGTM

SparkQA · 2018-05-05T13:59:31Z

Test build #90250 has finished for PR 14083 at commit ac4abcd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-05T18:21:01Z

Test build #90252 has finished for PR 14083 at commit cbc164d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-06T01:49:04Z

@hvanhovell . Could you update the PR description benchmark result, too?

hvanhovell · 2018-05-07T09:19:23Z

@dongjoon-hyun done. Still a 4x improvement.

hvanhovell · 2018-05-07T09:20:27Z

Merging to master. Thanks for all the reviews!

dongjoon-hyun · 2018-05-08T20:21:03Z

Thank you for starting and completing this! :D

`LogicalPlan.resolve(...)` uses linear searches to find an attribute matching a name. This is fine in normal cases, but gets problematic when you try to resolve a large number of columns on a plan with a large number of attributes. This PR adds an indexing structure to `resolve(...)` in order to find potential matches quicker. This PR improves the reference resolution time for the following code by 4x (11.8s -> 2.4s): ``` scala val n = 4000 val values = (1 to n).map(_.toString).mkString(", ") val columns = (1 to n).map("column" + _).mkString(", ") val query = s""" |SELECT $columns |FROM VALUES ($values) T($columns) |WHERE 1=2 AND 1 IN ($columns) |GROUP BY $columns |ORDER BY $columns |""".stripMargin spark.time(sql(query)) ``` Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#14083 from hvanhovell/SPARK-16406.

* [SPARK-16406][SQL] Improve performance of LogicalPlan.resolve `LogicalPlan.resolve(...)` uses linear searches to find an attribute matching a name. This is fine in normal cases, but gets problematic when you try to resolve a large number of columns on a plan with a large number of attributes. This PR adds an indexing structure to `resolve(...)` in order to find potential matches quicker. This PR improves the reference resolution time for the following code by 4x (11.8s -> 2.4s): ``` scala val n = 4000 val values = (1 to n).map(_.toString).mkString(", ") val columns = (1 to n).map("column" + _).mkString(", ") val query = s""" |SELECT $columns |FROM VALUES ($values) T($columns) |WHERE 1=2 AND 1 IN ($columns) |GROUP BY $columns |ORDER BY $columns |""".stripMargin spark.time(sql(query)) ``` Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#14083 from hvanhovell/SPARK-16406. * Keep old-style messages for AnalysisException with ambiguous references

* [SPARK-23816][CORE] Killed tasks should ignore FetchFailures. SPARK-19276 ensured that FetchFailures do not get swallowed by other layers of exception handling, but it also meant that a killed task could look like a fetch failure. This is particularly a problem with speculative execution, where we expect to kill tasks as they are reading shuffle data. The fix is to ensure that we always check for killed tasks first. Added a new unit test which fails before the fix, ran it 1k times to check for flakiness. Full suite of tests on jenkins. Author: Imran Rashid <irashid@cloudera.com> Closes apache#20987 from squito/SPARK-23816. (cherry picked from commit 10f45bb) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> * [SPARK-24007][SQL] EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen. `EqualNullSafe` for `FloatType` and `DoubleType` might generate a wrong result by codegen. ```scala scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF() df: org.apache.spark.sql.DataFrame = [_1: double, _2: double] scala> df.show() +----+----+ | _1| _2| +----+----+ |-1.0|null| |null|-1.0| +----+----+ scala> df.filter("_1 <=> _2").show() +----+----+ | _1| _2| +----+----+ |-1.0|null| |null|-1.0| +----+----+ ``` The result should be empty but the result remains two rows. Added a test. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#21094 from ueshin/issues/SPARK-24007/equalnullsafe. (cherry picked from commit f09a9e9) Signed-off-by: gatorsmile <gatorsmile@gmail.com> * [SPARK-23963][SQL] Properly handle large number of columns in query on text-based Hive table ## What changes were proposed in this pull request? TableReader would get disproportionately slower as the number of columns in the query increased. I fixed the way TableReader was looking up metadata for each column in the row. Previously, it had been looking up this data in linked lists, accessing each linked list by an index (column number). Now it looks up this data in arrays, where indexing by column number works better. ## How was this patch tested? Manual testing All sbt unit tests python sql tests Author: Bruce Robbins <bersprockets@gmail.com> Closes apache#21043 from bersprockets/tabreadfix. * [MINOR][DOCS] Fix comments of SQLExecution#withExecutionId ## What changes were proposed in this pull request? Fix comment. Change `BroadcastHashJoin.broadcastFuture` to `BroadcastExchangeExec.relationFuture`: https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L66 ## How was this patch tested? N/A Author: seancxmao <seancxmao@gmail.com> Closes apache#21113 from seancxmao/SPARK-13136. (cherry picked from commit c303b1b) Signed-off-by: hyukjinkwon <gurwls223@apache.org> * [SPARK-23941][MESOS] Mesos task failed on specific spark app name ## What changes were proposed in this pull request? Shell escaped the name passed to spark-submit and change how conf attributes are shell escaped. ## How was this patch tested? This test has been tested manually with Hive-on-spark with mesos or with the use case described in the issue with the sparkPi application with a custom name which contains illegal shell characters. With this PR, hive-on-spark on mesos works like a charm with hive 3.0.0-SNAPSHOT. I state that this contribution is my original work and that I license the work to the project under the project’s open source license Author: Bounkong Khamphousone <bounkong.khamphousone@ebiznext.com> Closes apache#21014 from tiboun/fix/SPARK-23941. (cherry picked from commit 6782359) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> * [SPARK-23433][CORE] Late zombie task completions update all tasksets Fetch failure lead to multiple tasksets which are active for a given stage. While there is only one "active" version of the taskset, the earlier attempts can still have running tasks, which can complete successfully. So a task completion needs to update every taskset so that it knows the partition is completed. That way the final active taskset does not try to submit another task for the same partition, and so that it knows when it is completed and when it should be marked as a "zombie". Added a regression test. Author: Imran Rashid <irashid@cloudera.com> Closes apache#21131 from squito/SPARK-23433. (cherry picked from commit 94641fe) Signed-off-by: Imran Rashid <irashid@cloudera.com> * [SPARK-23489][SQL][TEST][BRANCH-2.2] HiveExternalCatalogVersionsSuite should verify the downloaded file ## What changes were proposed in this pull request? This is a backport of apache#21210 because `branch-2.2` also faces the same failures. Although [SPARK-22654](https://issues.apache.org/jira/browse/SPARK-22654) made `HiveExternalCatalogVersionsSuite` download from Apache mirrors three times, it has been flaky because it didn't verify the downloaded file. Some Apache mirrors terminate the downloading abnormally, the *corrupted* file shows the following errors. ``` gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now 22:46:32.700 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.hive.HiveExternalCatalogVersionsSuite, thread names: Keep-Alive-Timer ===== *** RUN ABORTED *** java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.0"): error=2, No such file or directory ``` This has been reported weirdly in two ways. For example, the above case is reported as Case 2 `no failures`. - Case 1. [Test Result (1 failure / +1)](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389/) - Case 2. [Test Result (no failures)](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/) This PR aims to make `HiveExternalCatalogVersionsSuite` more robust by verifying the downloaded `tgz` file by extracting and checking the existence of `bin/spark-submit`. If it turns out that the file is empty or corrupted, `HiveExternalCatalogVersionsSuite` will do retry logic like the download failure. ## How was this patch tested? Pass the Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#21232 from dongjoon-hyun/SPARK-23489-2. * [SPARK-23697][CORE] LegacyAccumulatorWrapper should define isZero correctly ## What changes were proposed in this pull request? It's possible that Accumulators of Spark 1.x may no longer work with Spark 2.x. This is because `LegacyAccumulatorWrapper.isZero` may return wrong answer if `AccumulableParam` doesn't define equals/hashCode. This PR fixes this by using reference equality check in `LegacyAccumulatorWrapper.isZero`. ## How was this patch tested? a new test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21229 from cloud-fan/accumulator. (cherry picked from commit 4d5de4d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 This PR aims to bump Py4J in order to fix the following float/double bug. Py4J 0.10.5 fixes this (py4j/py4j#272) and the latest Py4J is 0.10.6. **BEFORE** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +--------------------+ |(id + 17.1335742042)| +--------------------+ | 17.1335742042| +--------------------+ ``` **AFTER** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +-------------------------+ |(id + 17.133574204226083)| +-------------------------+ | 17.133574204226083| +-------------------------+ ``` Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#18546 from dongjoon-hyun/SPARK-21278. (cherry picked from commit c8d0aba) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> * [SPARK-16406][SQL] Improve performance of LogicalPlan.resolve `LogicalPlan.resolve(...)` uses linear searches to find an attribute matching a name. This is fine in normal cases, but gets problematic when you try to resolve a large number of columns on a plan with a large number of attributes. This PR adds an indexing structure to `resolve(...)` in order to find potential matches quicker. This PR improves the reference resolution time for the following code by 4x (11.8s -> 2.4s): ``` scala val n = 4000 val values = (1 to n).map(_.toString).mkString(", ") val columns = (1 to n).map("column" + _).mkString(", ") val query = s""" |SELECT $columns |FROM VALUES ($values) T($columns) |WHERE 1=2 AND 1 IN ($columns) |GROUP BY $columns |ORDER BY $columns |""".stripMargin spark.time(sql(query)) ``` Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#14083 from hvanhovell/SPARK-16406. * [PYSPARK] Update py4j to version 0.10.7. (cherry picked from commit cc613b5) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 323dc3a) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> * [SPARKR] Match pyspark features in SparkR communication protocol. (cherry picked from commit 628c7b5) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 16cd9ac) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> * Keep old-style messages for AnalysisException with ambiguous references

## What changes were proposed in this pull request? `LogicalPlan.resolve(...)` uses linear searches to find an attribute matching a name. This is fine in normal cases, but gets problematic when you try to resolve a large number of columns on a plan with a large number of attributes. This PR adds an indexing structure to `resolve(...)` in order to find potential matches quicker. This PR improves the reference resolution time for the following code by 4x (11.8s -> 2.4s): ``` scala val n = 4000 val values = (1 to n).map(_.toString).mkString(", ") val columns = (1 to n).map("column" + _).mkString(", ") val query = s""" |SELECT $columns |FROM VALUES ($values) T($columns) |WHERE 1=2 AND 1 IN ($columns) |GROUP BY $columns |ORDER BY $columns |""".stripMargin spark.time(sql(query)) ``` ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#14083 from hvanhovell/SPARK-16406.

Add AttributeResolver

b4c8cdb

Merge remote-tracking branch 'apache-github/master' into SPARK-16406

a8a6a83

move refresh back

3e39720

viirya reviewed Jul 7, 2016
View reviewed changes

Use Seq-based instead of List-based pattern matching

c75ae8d

jaceklaskowski reviewed Jul 7, 2016
View reviewed changes

Rename attribute after resolution.

a5d1a4a

hvanhovell added 2 commits September 11, 2016 21:42

Move resolution into AttribueSeq

d64346b

Merge remote-tracking branch 'apache-github/master' into SPARK-16406

a1e5312

hvanhovell changed the title ~~[SPARK-16406][SQL] Improve performance of LogicalPlan.resolve [WIP]~~ [SPARK-16406][SQL] Improve performance of LogicalPlan.resolve Sep 11, 2016

Merge remote-tracking branch 'apache/master' into SPARK-16406

85e0744

Fix compilation

eb6dd80

cloud-fan reviewed Apr 23, 2018

View reviewed changes

Fix UT

ac4abcd

Appease scalastyle

cbc164d

asfgit closed this in 4e861db May 7, 2018

markhamstra mentioned this pull request May 8, 2018

[SPARK-16406][SQL] Improve performance of LogicalPlan.resolve alteryx/spark#231

Merged

gaoyajun02 mentioned this pull request Aug 4, 2021

[SPARK-36339][SQL] References to grouping that not part of aggregation should be replaced #33574

Closed

[SPARK-16406][SQL] Improve performance of LogicalPlan.resolve #14083

[SPARK-16406][SQL] Improve performance of LogicalPlan.resolve #14083

Uh oh!

Conversation

hvanhovell commented Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell commented Jul 6, 2016

Uh oh!

dongjoon-hyun commented Jul 7, 2016

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

dongjoon-hyun commented Jul 7, 2016

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

hvanhovell commented Jul 7, 2016

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

viirya commented Jul 7, 2016

Uh oh!

hvanhovell commented Jul 7, 2016

Uh oh!

viirya commented Jul 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

SparkQA commented Jul 8, 2016

Uh oh!

dongjoon-hyun commented Jul 8, 2016

Uh oh!

JoshRosen commented Sep 1, 2016

Uh oh!

hvanhovell commented Sep 11, 2016

Uh oh!

SparkQA commented Sep 11, 2016

Uh oh!

gatorsmile commented Jun 16, 2017

Uh oh!

gatorsmile commented Oct 28, 2017

Uh oh!

hvanhovell commented Apr 20, 2018

Uh oh!

SparkQA commented Apr 20, 2018

Uh oh!

SparkQA commented Apr 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 23, 2018

Uh oh!

SparkQA commented May 5, 2018

Uh oh!

SparkQA commented May 5, 2018

Uh oh!

dongjoon-hyun commented May 6, 2018

Uh oh!

hvanhovell commented May 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented Jul 6, 2016 •

edited

Loading

hvanhovell commented May 7, 2018 •

edited

Loading