-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8300] DataFrame hint for broadcast join. #6751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rxin - This is great. I'm a fan of the alternative syntax we chatted about: |
|
Test build #34641 has finished for PR 6751 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just have this be StatisticsHint and override the statistics? I'm afraid that we are going to forget to add cases in the future as we broadcast more type of joins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about this more and actually maybe what we want to do is have this specific node, and a pattern that recognizes canBroadcast. This pattern can check either for a small enough size or this hint and we can use that anywhere we are planning a broadcast operator. The reasoning is that messing with statistics could have other consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Add specific node is also doable, probably the node can be named like Hint in LogicalPlan, and could be more than broadcast, like uniq_key etc.
Any idea?
|
Regarding |
|
We should add isNotNull too. The isNotNull was from SchemaRDD. |
|
I'm confused now: both |
|
I meant adding isNotNull as a function, rather than a member method of Column. Basically most of those Column.function were inherited from SchemaRDD time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus is this what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
|
Test build #35520 has finished for PR 6751 at commit
|
|
Merging to master! |
|
How can this hint be used with Spark-SQL ? Below example does not seem to work.
|
|
It can only be used from the dataframe API.
|
|
Using broadcast(...) results in "java.lang.AssertionError: assertion failed: No plan for BroadcastHint" as in https://issues.apache.org/jira/browse/SPARK-12275 so this solution is not working most of the time |
|
Isn't the hint available in SQL? |
|
Its available from spark 2.2.0.
On Tuesday, October 10, 2017, 1:46:27 PM PDT, Reynold Xin <notifications@github.com> wrote:
Isn't the hint available in SQL?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
Is there an example?
|
|
Let me know if this does not help
[SPARK-16475] Broadcast Hint for SQL Queries - ASF JIRA
|
|
| |
[SPARK-16475] Broadcast Hint for SQL Queries - ASF JIRA
|
|
|
On Tuesday, October 10, 2017, 6:25:57 PM PDT, fjh100456 <notifications@github.com> wrote:
Is there an example?
I use broadcast like the following, but it perform an error.Would you be so kind as to show me an example?
spark-sql> select a.* from tableA a left outer join broadcast(tableB) b on a.a=b.a;
17/10/11 09:06:40 INFO HiveMetaStore: 0: get_table : db=default tbl=tablea
17/10/11 09:06:40 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=tablea
Error in query: cannot resolve 'tableB' given input columns: []; line 1 pos 51;
'Project [ArrayBuffer(a).*]
+- 'Join LeftOuter, ('a.a = 'b.a)
:- SubqueryAlias a
: +- SubqueryAlias tablea
: +- HiveTableRelation default.tablea, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#94, b#95, c#96]
+- 'SubqueryAlias b
+- 'UnresolvedTableValuedFunction broadcast, ['tableB]
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
With |
Users can now do
left.join(broadcast(right), "joinKey")to give the query planner a hint that "right" DataFrame is small and should be broadcasted.