-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17946][PYSPARK] Python crossJoin API similar to Scala #15493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #66981 has finished for PR 15493 at commit
|
Test build #66982 has finished for PR 15493 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -627,6 +627,25 @@ def alias(self, alias): | |||
return DataFrame(getattr(self._jdf, "as")(alias), self.sql_ctx) | |||
|
|||
@ignore_unicode_prefix | |||
@since(2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be 2.1?
@@ -627,6 +627,25 @@ def alias(self, alias): | |||
return DataFrame(getattr(self._jdf, "as")(alias), self.sql_ctx) | |||
|
|||
@ignore_unicode_prefix | |||
@since(2.1) | |||
def crossJoin(self, other): | |||
"""Returns the cartesian product with another :class:`DataFrame` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
LGTM pending Jenkins. |
Test build #66996 has finished for PR 15493 at commit
|
Merging in master. |
cc @felixcheung do we need some change for R? |
Test build #66998 has finished for PR 15493 at commit
|
@rxin In R, CrossJoin is the default when expr is empty (https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2304) |
The issue is that we want to prevent users from shooting themselves in the foot, i.e. we want to avoid accidental cross joins. The idea is unless the user explicitly says crossJoin, we should disallow crossjoins. |
That's a great point. Currently R is the same as Python in that when Problem is by default |
Why not just introduce a crossJoin function in R, similar to Python/Scala/Java? We don't want to change the default join type, because it is still valid to run an inner join by specifying a predicate later using the filter operator. |
cc @felixcheung We still need this. I'm going to create an upstream ticket. Can one of you take it? |
I will take that and add my note. |
Thanks! |
## What changes were proposed in this pull request? Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API ## How was this patch tested? Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin() (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Srinath Shankar <srinath@databricks.com> Closes apache#15493 from srinathshankar/crosspython.
## What changes were proposed in this pull request? Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API ## How was this patch tested? Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin() (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Srinath Shankar <srinath@databricks.com> Closes apache#15493 from srinathshankar/crosspython.
What changes were proposed in this pull request?
Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API
How was this patch tested?
Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin()
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.