Skip to content

[SPARK-17946][PYSPARK] Python crossJoin API similar to Scala #15493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

srinathshankar
Copy link
Contributor

What changes were proposed in this pull request?

Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API

How was this patch tested?

Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin()

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

@srinathshankar
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Oct 14, 2016

Test build #66981 has finished for PR 15493 at commit 9b4d995.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2016

Test build #66982 has finished for PR 15493 at commit 16c0842.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@sameeragarwal sameeragarwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -627,6 +627,25 @@ def alias(self, alias):
return DataFrame(getattr(self._jdf, "as")(alias), self.sql_ctx)

@ignore_unicode_prefix
@since(2.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be 2.1?

@@ -627,6 +627,25 @@ def alias(self, alias):
return DataFrame(getattr(self._jdf, "as")(alias), self.sql_ctx)

@ignore_unicode_prefix
@since(2.1)
def crossJoin(self, other):
"""Returns the cartesian product with another :class:`DataFrame`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@rxin
Copy link
Contributor

rxin commented Oct 14, 2016

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Oct 15, 2016

Test build #66996 has finished for PR 15493 at commit a450857.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Oct 15, 2016

Merging in master.

@rxin
Copy link
Contributor

rxin commented Oct 15, 2016

cc @felixcheung do we need some change for R?

@asfgit asfgit closed this in 2d96d35 Oct 15, 2016
@SparkQA
Copy link

SparkQA commented Oct 15, 2016

Test build #66998 has finished for PR 15493 at commit 8b60ef2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

@rxin In R, CrossJoin is the default when expr is empty (https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2304)
I reviewed the code and documentation I think it is sufficient

@rxin
Copy link
Contributor

rxin commented Oct 15, 2016

The issue is that we want to prevent users from shooting themselves in the foot, i.e. we want to avoid accidental cross joins. The idea is unless the user explicitly says crossJoin, we should disallow crossjoins.

@felixcheung
Copy link
Member

That's a great point. Currently R is the same as Python in that when joinExpr is NULL (R) or on is None (Python), CrossJoin is assumed. (Python here)

Problem is by default joinExpr = NULL (R) and on = None (Python) - one approach is to change both defaults, so that when omitted, they don't default to CrossJoin.
Another approach is to additional ask for joinType or how of a new cross_join - but this could be a bigger change.

@rxin
Copy link
Contributor

rxin commented Oct 17, 2016

Why not just introduce a crossJoin function in R, similar to Python/Scala/Java?

We don't want to change the default join type, because it is still valid to run an inner join by specifying a predicate later using the filter operator.

@rxin
Copy link
Contributor

rxin commented Oct 19, 2016

cc @felixcheung

We still need this. I'm going to create an upstream ticket. Can one of you take it?

@srinathshankar
Copy link
Contributor Author

@felixcheung
Copy link
Member

I will take that and add my note.

@rxin
Copy link
Contributor

rxin commented Oct 20, 2016

Thanks!

robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?

Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API

## How was this patch tested?
Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin()

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

Author: Srinath Shankar <srinath@databricks.com>

Closes apache#15493 from srinathshankar/crosspython.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API

## How was this patch tested?
Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin()

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

Author: Srinath Shankar <srinath@databricks.com>

Closes apache#15493 from srinathshankar/crosspython.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants