[SPARK-15764][SQL] Replace N^2 loop in BindReferences #13505

JoshRosen · 2016-06-04T00:19:38Z

BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the input array. Because input can sometimes be a List, the call to input(ordinal).nullable can also be O(n).

Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups.

Perf. benchmarks to follow. /cc @ericl

JoshRosen · 2016-06-04T00:20:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala

@@ -84,17 +84,27 @@ object BindReferences extends Logging {
      expression: A,
      input: Seq[Attribute],


I wonder whether we can push the map construction up one level so that we can amortize its cost across multiple bindReference calls.

Actually, yeah: in GenerateMutableProjection we use the same InputSchema for every expression.

I think we should add an overload which takes a sequence of expressions and binds all of their references. We should then replace the call sites in the various projection operators.

SparkQA · 2016-06-04T00:26:29Z

Test build #59975 has finished for PR 13505 at commit 6216e94.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-06-04T01:14:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

@@ -296,7 +296,7 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT
  /**
   * All the attributes that are used for this plan.
   */
-  lazy val allAttributes: Seq[Attribute] = children.flatMap(_.output)
+  lazy val allAttributes: AttributeSeq = children.flatMap(_.output)


@ericl and I found another layer of polynomial looping: in QueryPlan.cleanArgs we take every expression in the query plan and bind its references against allAttributes, which can be huge. If we turn this into an AttributeSeq once and build the map inside of that wrapper then we amortize that cost and remove this expensive loop.

We should probably construct the AttributeSeq outside of the loop in the various projection operators, too, although that doesn't appear to be as serious a bottleneck yet.

JoshRosen · 2016-06-04T01:55:13Z

@rxin, @ericl has some new benchmarks which operate on even wider schemas and which uncovered this bottleneck. Adding the caching of the map here resulted in a huge scalability improvement. Maybe @ericl can chime in with some flame graph charts here.

SparkQA · 2016-06-04T02:16:52Z

Test build #59980 has finished for PR 13505 at commit 0b412b0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class AttributeSeq(val attrs: Seq[Attribute])

ericl · 2016-06-04T02:26:21Z

Here's a flame graph of bindReferences dominating the CPU for a 10k column query:
original svg

SparkQA · 2016-06-04T02:28:48Z

Test build #59976 has finished for PR 13505 at commit 38e8a99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-04T05:21:23Z

hm probably shouldn't happen in this pr but i'm wondering if it'd make sense to generalize AttributeSeq and use it everywhere, rather than Seq[Attribute].

kiszk · 2016-06-04T16:08:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

+    private lazy val inputArr = attrs.toArray
+
+    private lazy val inputToOrdinal = {
+      val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2)


Why *2 is necessary?
I think that the size of map's entry is up to attrs.size since the max number of calling map.put() is equal to `attrs.size. Isattrs.size``equal to``inputArr.legnth``?

The goal was to avoid having to rehash the elements of the hash map once the number of inserted keys exceeded the default 0.75 load factor.

It's probably clearer to use Guava's newHashMapWithExpectedSize instead: https://google.github.io/guava/releases/snapshot/api/docs/com/google/common/collect/Maps.html#newHashMapWithExpectedSize(int)

+1 on withExpectedSize

JoshRosen · 2016-06-04T23:01:28Z

@rxin, I think that it might make sense to use AttributeSeq more widely. Right now there's an implicit conversion so we can gradually and naively migrate APIs to accept AttributeSeq.

…rovement

JoshRosen · 2016-06-04T23:51:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala

@@ -26,13 +26,6 @@ object AttributeMap {
  def apply[A](kvs: Seq[(Attribute, A)]): AttributeMap[A] = {
    new AttributeMap(kvs.map(kv => (kv._1.exprId, kv)).toMap)
  }
-
-  /** Given a schema, constructs an [[AttributeMap]] from [[Attribute]] to ordinal */
-  def byIndex(schema: Seq[Attribute]): AttributeMap[Int] = apply(schema.zipWithIndex)


This was vaguely-related yet unused code that I stumbled across while looking for similar occurrences of this pattern, so I decided to remove it.

JoshRosen · 2016-06-04T23:51:38Z

Alright, updated to address comments.

SparkQA · 2016-06-05T00:04:36Z

Test build #3066 has started for PR 13505 at commit 4efd3ee.

rxin · 2016-06-05T00:41:08Z

lgtm - I didn't look too closely though. Would be great @ericl to look at this in detail.

SparkQA · 2016-06-05T04:02:31Z

Test build #59994 has finished for PR 13505 at commit 4efd3ee.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-06-05T17:46:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

+    /**
+     * Returns the index of first attribute with a matching expression id, or -1 if no match exists.
+     */
+    def getOrdinalWithExprId(exprId: ExprId): Int = {


Would indexOf be more clear?

I had this originally and then moved to this name in anticipation of a future change which would add more "get index with property" methods, but a lot of those methods aren't cachable (e.g. semanticEquals), so I'll revert this back to my first name choice.

ericl · 2016-06-05T17:48:16Z

Lgtm with minor comments

cloud-fan · 2016-06-05T18:52:11Z

LGTM

SparkQA · 2016-06-05T22:32:32Z

Test build #60011 has finished for PR 13505 at commit 5504b6c.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-05T22:48:26Z

retest this please

JoshRosen · 2016-06-06T02:44:59Z

Jenkins, retest this please.

…rovement

SparkQA · 2016-06-06T03:02:32Z

Test build #60015 has finished for PR 13505 at commit 5504b6c.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-06T06:57:34Z

Test build #60019 has finished for PR 13505 at commit 5504b6c.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-06T09:54:47Z

Test build #60029 has finished for PR 13505 at commit 5e9c258.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class AttributeSeq(val attrs: Seq[Attribute]) extends Serializable

JoshRosen · 2016-06-06T18:33:17Z

Fixed the tests by making AttributeSeq serializable. I'm going to merge this into master and branch-2.0.

BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n). Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. Perf. benchmarks to follow. /cc ericl Author: Josh Rosen <joshrosen@databricks.com> Closes #13505 from JoshRosen/bind-references-improvement. (cherry picked from commit 0b8d694) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Replace N^2 loop in BindReferences.

6216e94

JoshRosen reviewed Jun 4, 2016
View reviewed changes

JoshRosen added 3 commits June 3, 2016 17:36

Fix Java 7 compilation issue.

b1a7646

Whoops, getOrDefault is also new in Java 8.

38e8a99

Amortize map construction.

0b412b0

JoshRosen reviewed Jun 4, 2016
View reviewed changes

kiszk reviewed Jun 4, 2016
View reviewed changes

JoshRosen added 6 commits June 4, 2016 16:04

Merge remote-tracking branch 'origin/master' into bind-references-imp…

bc17587

…rovement

Fix similar loop in InMemoryTableScanExec.

e7c4150

Rename getOrdinal to getOrdinalWithExprId

210dbd3

Delete some related yet dead code in AttributeMap.

dd94e29

Code style and readability improvements.

b933fe0

Comments.

4efd3ee

JoshRosen reviewed Jun 4, 2016
View reviewed changes

ericl reviewed Jun 5, 2016
View reviewed changes

Address Eric's comments.

5504b6c

Merge remote-tracking branch 'origin/master' into bind-references-imp…

bdb68ad

…rovement

JoshRosen added 2 commits June 6, 2016 00:34

Add missing @transient annotations.

99197b7

Also extend Serializable.

5e9c258

asfgit closed this in 0b8d694 Jun 6, 2016

JoshRosen deleted the bind-references-improvement branch June 6, 2016 18:56

JoshRosen mentioned this pull request Sep 1, 2016

[SPARK-16406][SQL] Improve performance of LogicalPlan.resolve #14083

Closed

dongjoon-hyun mentioned this pull request Dec 11, 2018

[MINOR][SQL] Some errors in the notes. #23280

Closed

		@@ -84,17 +84,27 @@ object BindReferences extends Logging {
		expression: A,
		input: Seq[Attribute],

[SPARK-15764][SQL] Replace N^2 loop in BindReferences #13505

[SPARK-15764][SQL] Replace N^2 loop in BindReferences #13505

Uh oh!

Conversation

JoshRosen commented Jun 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jun 4, 2016

Uh oh!

SparkQA commented Jun 4, 2016

Uh oh!

ericl commented Jun 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 4, 2016

Uh oh!

rxin commented Jun 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jun 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jun 4, 2016

Uh oh!

SparkQA commented Jun 5, 2016

Uh oh!

rxin commented Jun 5, 2016

Uh oh!

SparkQA commented Jun 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl commented Jun 5, 2016

Uh oh!

cloud-fan commented Jun 5, 2016

Uh oh!

SparkQA commented Jun 5, 2016

Uh oh!

cloud-fan commented Jun 5, 2016

Uh oh!

JoshRosen commented Jun 6, 2016

Uh oh!

SparkQA commented Jun 6, 2016

Uh oh!

SparkQA commented Jun 6, 2016

Uh oh!

SparkQA commented Jun 6, 2016

Uh oh!

JoshRosen commented Jun 6, 2016

Uh oh!

Uh oh!

ericl commented Jun 4, 2016 •

edited

Loading