[SPARK-17946][PYSPARK] Python crossJoin API similar to Scala #15493

srinathshankar · 2016-10-14T21:11:49Z

What changes were proposed in this pull request?

Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API

How was this patch tested?

Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin()

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

srinathshankar · 2016-10-14T21:12:09Z

cc @sameeragarwal @hvanhovell @davies @JoshRosen

SparkQA · 2016-10-14T21:13:46Z

Test build #66981 has finished for PR 15493 at commit 9b4d995.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-14T22:11:16Z

Test build #66982 has finished for PR 15493 at commit 16c0842.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal

LGTM

sameeragarwal · 2016-10-14T22:34:45Z

python/pyspark/sql/dataframe.py

@@ -627,6 +627,25 @@ def alias(self, alias):
        return DataFrame(getattr(self._jdf, "as")(alias), self.sql_ctx)

    @ignore_unicode_prefix
+    @since(2.0)


shouldn't this be 2.1?

rxin · 2016-10-14T23:36:50Z

python/pyspark/sql/dataframe.py

@@ -627,6 +627,25 @@ def alias(self, alias):
        return DataFrame(getattr(self._jdf, "as")(alias), self.sql_ctx)

    @ignore_unicode_prefix
+    @since(2.1)
+    def crossJoin(self, other):
+        """Returns the cartesian product with another :class:`DataFrame`


nit: add a period.

rxin · 2016-10-14T23:52:05Z

LGTM pending Jenkins.

SparkQA · 2016-10-15T01:02:12Z

Test build #66996 has finished for PR 15493 at commit a450857.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-15T01:23:52Z

Merging in master.

rxin · 2016-10-15T01:24:38Z

cc @felixcheung do we need some change for R?

SparkQA · 2016-10-15T01:53:35Z

Test build #66998 has finished for PR 15493 at commit 8b60ef2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-15T21:48:22Z

@rxin In R, CrossJoin is the default when expr is empty (https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2304)
I reviewed the code and documentation I think it is sufficient

rxin · 2016-10-15T22:20:08Z

The issue is that we want to prevent users from shooting themselves in the foot, i.e. we want to avoid accidental cross joins. The idea is unless the user explicitly says crossJoin, we should disallow crossjoins.

felixcheung · 2016-10-16T04:10:47Z

That's a great point. Currently R is the same as Python in that when joinExpr is NULL (R) or on is None (Python), CrossJoin is assumed. (Python here)

Problem is by default joinExpr = NULL (R) and on = None (Python) - one approach is to change both defaults, so that when omitted, they don't default to CrossJoin.
Another approach is to additional ask for joinType or how of a new cross_join - but this could be a bigger change.

rxin · 2016-10-17T18:58:08Z

Why not just introduce a crossJoin function in R, similar to Python/Scala/Java?

We don't want to change the default join type, because it is still valid to run an inner join by specifying a predicate later using the filter operator.

rxin · 2016-10-19T18:15:05Z

cc @felixcheung

We still need this. I'm going to create an upstream ticket. Can one of you take it?

srinathshankar · 2016-10-19T18:25:38Z

Created https://issues.apache.org/jira/browse/SPARK-18013
@felixcheung @rxin

felixcheung · 2016-10-20T02:29:32Z

I will take that and add my note.

rxin · 2016-10-20T02:32:18Z

Thanks!

## What changes were proposed in this pull request? Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API ## How was this patch tested? Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin() (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Srinath Shankar <srinath@databricks.com> Closes apache#15493 from srinathshankar/crosspython.

[SPARK-17946][PYSPARK] Python crossJoin API similar to Scala

9b4d995

Break long result line

16c0842

sameeragarwal approved these changes Oct 14, 2016

View reviewed changes

Fixed version to 2.1 for Scaal and Python

a450857

rxin reviewed Oct 14, 2016

View reviewed changes

Added period

8b60ef2

asfgit closed this in 2d96d35 Oct 15, 2016

[SPARK-17946][PYSPARK] Python crossJoin API similar to Scala #15493

[SPARK-17946][PYSPARK] Python crossJoin API similar to Scala #15493

Uh oh!

Conversation

srinathshankar commented Oct 14, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srinathshankar commented Oct 14, 2016

Uh oh!

SparkQA commented Oct 14, 2016

Uh oh!

SparkQA commented Oct 14, 2016

Uh oh!

sameeragarwal left a comment

Choose a reason for hiding this comment

Uh oh!

sameeragarwal Oct 14, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Oct 14, 2016

Choose a reason for hiding this comment

Uh oh!

srinathshankar Oct 14, 2016

Choose a reason for hiding this comment

Uh oh!

rxin commented Oct 14, 2016

Uh oh!

SparkQA commented Oct 15, 2016

Uh oh!

rxin commented Oct 15, 2016

Uh oh!

rxin commented Oct 15, 2016

Uh oh!

SparkQA commented Oct 15, 2016

Uh oh!

felixcheung commented Oct 15, 2016

Uh oh!

rxin commented Oct 15, 2016

Uh oh!

felixcheung commented Oct 16, 2016

Uh oh!

rxin commented Oct 17, 2016

Uh oh!

rxin commented Oct 19, 2016

Uh oh!

srinathshankar commented Oct 19, 2016

Uh oh!

felixcheung commented Oct 20, 2016

Uh oh!

rxin commented Oct 20, 2016

Uh oh!

Uh oh!