-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-15764][SQL] Replace N^2 loop in BindReferences #13505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -84,17 +84,27 @@ object BindReferences extends Logging { | |||
expression: A, | |||
input: Seq[Attribute], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we can push the map construction up one level so that we can amortize its cost across multiple bindReference
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, yeah: in GenerateMutableProjection
we use the same InputSchema for every expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add an overload which takes a sequence of expressions and binds all of their references. We should then replace the call sites in the various projection operators.
Test build #59975 has finished for PR 13505 at commit
|
@@ -296,7 +296,7 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT | |||
/** | |||
* All the attributes that are used for this plan. | |||
*/ | |||
lazy val allAttributes: Seq[Attribute] = children.flatMap(_.output) | |||
lazy val allAttributes: AttributeSeq = children.flatMap(_.output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl and I found another layer of polynomial looping: in QueryPlan.cleanArgs we take every expression in the query plan and bind its references against allAttributes
, which can be huge. If we turn this into an AttributeSeq
once and build the map inside of that wrapper then we amortize that cost and remove this expensive loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably construct the AttributeSeq outside of the loop in the various projection operators, too, although that doesn't appear to be as serious a bottleneck yet.
Test build #59980 has finished for PR 13505 at commit
|
Here's a flame graph of bindReferences dominating the CPU for a 10k column query: |
Test build #59976 has finished for PR 13505 at commit
|
hm probably shouldn't happen in this pr but i'm wondering if it'd make sense to generalize AttributeSeq and use it everywhere, rather than Seq[Attribute]. |
private lazy val inputArr = attrs.toArray | ||
|
||
private lazy val inputToOrdinal = { | ||
val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why *2
is necessary?
I think that the size of map's entry is up to attrs.size
since the max number of calling map.put()
is equal to `attrs.size. Is
attrs.size``equal to``inputArr.legnth``?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal was to avoid having to rehash the elements of the hash map once the number of inserted keys exceeded the default 0.75 load factor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably clearer to use Guava's newHashMapWithExpectedSize
instead: https://google.github.io/guava/releases/snapshot/api/docs/com/google/common/collect/Maps.html#newHashMapWithExpectedSize(int)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on withExpectedSize
@rxin, I think that it might make sense to use AttributeSeq more widely. Right now there's an implicit conversion so we can gradually and naively migrate APIs to accept AttributeSeq. |
@@ -26,13 +26,6 @@ object AttributeMap { | |||
def apply[A](kvs: Seq[(Attribute, A)]): AttributeMap[A] = { | |||
new AttributeMap(kvs.map(kv => (kv._1.exprId, kv)).toMap) | |||
} | |||
|
|||
/** Given a schema, constructs an [[AttributeMap]] from [[Attribute]] to ordinal */ | |||
def byIndex(schema: Seq[Attribute]): AttributeMap[Int] = apply(schema.zipWithIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was vaguely-related yet unused code that I stumbled across while looking for similar occurrences of this pattern, so I decided to remove it.
Alright, updated to address comments. |
Test build #3066 has started for PR 13505 at commit |
lgtm - I didn't look too closely though. Would be great @ericl to look at this in detail. |
Test build #59994 has finished for PR 13505 at commit
|
/** | ||
* Returns the index of first attribute with a matching expression id, or -1 if no match exists. | ||
*/ | ||
def getOrdinalWithExprId(exprId: ExprId): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would indexOf
be more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this originally and then moved to this name in anticipation of a future change which would add more "get index with property" methods, but a lot of those methods aren't cachable (e.g. semanticEquals), so I'll revert this back to my first name choice.
Lgtm with minor comments |
LGTM |
Test build #60011 has finished for PR 13505 at commit
|
retest this please |
Jenkins, retest this please. |
Test build #60015 has finished for PR 13505 at commit
|
Test build #60019 has finished for PR 13505 at commit
|
Test build #60029 has finished for PR 13505 at commit
|
Fixed the tests by making |
BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n). Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. Perf. benchmarks to follow. /cc ericl Author: Josh Rosen <joshrosen@databricks.com> Closes #13505 from JoshRosen/bind-references-improvement. (cherry picked from commit 0b8d694) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the
input
array. Because input can sometimes be aList
, the call toinput(ordinal).nullable
can also be O(n).Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups.
Perf. benchmarks to follow. /cc @ericl