Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Jun 10, 2015

Users can now do

left.join(broadcast(right), "joinKey")

to give the query planner a hint that "right" DataFrame is small and should be broadcasted.

@patmcdonough
Copy link

@rxin - This is great. I'm a fan of the alternative syntax we chatted about:

left.join(right.broadcast, "joinKey")

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34641 has finished for PR 6751 at commit 3ad8de6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BroadcastHint(child: LogicalPlan) extends UnaryNode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just have this be StatisticsHint and override the statistics? I'm afraid that we are going to forget to add cases in the future as we broadcast more type of joins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about this more and actually maybe what we want to do is have this specific node, and a pattern that recognizes canBroadcast. This pattern can check either for a small enough size or this hint and we can use that anywhere we are planning a broadcast operator. The reasoning is that messing with statistics could have other consequences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Add specific node is also doable, probably the node can be named like Hint in LogicalPlan, and could be more than broadcast, like uniq_key etc.

Any idea?

@marmbrus
Copy link
Contributor

Regarding broadcast(...) vs .broadcast, I don't really care too much. However, I think we need to come up with a convention. For example, I was recently confused that its not isNotNull(column) but instead column.isNotNull.

@rxin
Copy link
Contributor Author

rxin commented Jun 14, 2015

We should add isNotNull too.

The isNotNull was from SchemaRDD.

@marmbrus
Copy link
Contributor

I'm confused now: both column.isNotNull and column.isNull work today. I was expecting them to be functions instead of being functions that column defines. I'm mostly wondering how we decide what goes where.

@rxin
Copy link
Contributor Author

rxin commented Jun 14, 2015

I meant adding isNotNull as a function, rather than a member method of Column.

Basically most of those Column.function were inherited from SchemaRDD time.

@rxin rxin force-pushed the broadcastjoin-hint branch from 3ad8de6 to 953eec2 Compare June 23, 2015 05:46
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus is this what you had in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

@SparkQA
Copy link

SparkQA commented Jun 23, 2015

Test build #35520 has finished for PR 6751 at commit 953eec2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Merging to master!

@asfgit asfgit closed this in 6ceb169 Jun 23, 2015
@rxin rxin deleted the broadcastjoin-hint branch June 23, 2015 18:38
@sridharsubramanian62
Copy link

How can this hint be used with Spark-SQL ?

Below example does not seem to work.

select a.col1,b.col2 from tableA a left outer join broadcast(tableB) on a.col1 = b.col1

@marmbrus
Copy link
Contributor

marmbrus commented Oct 8, 2015

It can only be used from the dataframe API.
On Oct 8, 2015 3:20 AM, "sridharsubramanian62" notifications@github.com
wrote:

How can this hint be used with Spark-SQL ?

Below example does not seem to work.

select a.col1,b.col2 from tableA a left outer join broadcast(tableB) on
a.col1 = b.col1


Reply to this email directly or view it on GitHub
#6751 (comment).

@RoiViber
Copy link

Using broadcast(...) results in "java.lang.AssertionError: assertion failed: No plan for BroadcastHint" as in https://issues.apache.org/jira/browse/SPARK-12275 so this solution is not working most of the time

@fjh100456
Copy link
Contributor

@rxin @marmbrus
Is there another way to broadcast table with the spark-sql now, except by spark.sql.autoBroadcastJoinThreshold?
And if no, is it a good way to broadcast table by user conf,such as spark.sql.autoBroadcastJoin.mytable=true to broadcast a table named mytable.

@rxin
Copy link
Contributor Author

rxin commented Oct 10, 2017

Isn't the hint available in SQL?

@sridharsubramanian62
Copy link

sridharsubramanian62 commented Oct 10, 2017 via email

@fjh100456
Copy link
Contributor

Is there an example?
I use broadcast like the following, but it perform an error.Would you be so kind as to show me an example?

spark-sql> select a.* from tableA a left outer join broadcast(tableB) b on a.a=b.a;
17/10/11 09:06:40 INFO HiveMetaStore: 0: get_table : db=default tbl=tablea
17/10/11 09:06:40 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=tablea

Error in query: cannot resolve 'tableB' given input columns: []; line 1 pos 51;

'Project [ArrayBuffer(a).*]
+- 'Join LeftOuter, ('a.a = 'b.a)
:- SubqueryAlias a
: +- SubqueryAlias tablea
: +- HiveTableRelation default.tablea, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#94, b#95, c#96]
+- 'SubqueryAlias b
+- 'UnresolvedTableValuedFunction broadcast, ['tableB]

@sridharsubramanian62
Copy link

sridharsubramanian62 commented Oct 11, 2017 via email

@fjh100456
Copy link
Contributor

With /*+ broadcast(table) */, it works well, thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants