[SPARK-25314][SQL] Fix Python UDF accessing attributes from both side of join in join conditions #22326

xuanyuanking · 2018-09-04T06:52:16Z

What changes were proposed in this pull request?

Thanks for @bahchis reporting this. It is more like a follow up work for #16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition.

How was this patch tested?

Add regression tests in PySpark and BatchEvalPythonExecSuite.

HyukjinKwon · 2018-09-04T07:01:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          // the new join conditions, if all conditions is unevaluable, we should
+          // change the join type to CrossJoin.
+          val newJoinType =
+            if (commonJoinCondition.nonEmpty && newJoinCond.isEmpty) Cross else joinType


I think we should at least warn or leave a note that unevaluable (or Python UDF) in the join condition will be ignored and turned to a cross join.

Make sense, I'll leave a warn log here.

what about also checking spark.sql.crossJoin.enabled and allow the transformation only in that case?

HyukjinKwon · 2018-09-04T07:03:57Z

python/pyspark/sql/tests.py

+        right = self.spark.createDataFrame([Row(b=1)])
+        f = udf(lambda a, b: a == b, BooleanType())
+        df = left.crossJoin(right).filter(f("a", "b"))
+        self.assertEqual(df.collect(), [Row(a=1, b=1)])
        self.assertEqual(df.collect(), [Row(a=1, b=1)])


Looks duplicated

Yep, sorry for the mess here, another commit left on. I'll fix soon, how to cancel the test?

HyukjinKwon · 2018-09-04T07:04:22Z

python/pyspark/sql/tests.py

+        left = self.spark.createDataFrame([Row(a=1)])
+        right = self.spark.createDataFrame([Row(b=1)])
+        f = udf(lambda a, b: a == b, BooleanType())
+        df = left.crossJoin(right).filter(f("a", "b"))


BTW, why do we explicitly test cross join here?

ditto, the correct test is df = left.join(right, f("a", "b")).

python/pyspark/sql/tests.py

SparkQA · 2018-09-04T07:05:01Z

Test build #95653 has finished for PR 22326 at commit 23b1028.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-04T07:05:07Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

@@ -97,6 +100,17 @@ class BatchEvalPythonExecSuite extends SparkPlanTest with SharedSQLContext {
    }
    assert(qualifiedPlanNodes.size == 1)
  }
+
+  test("Python UDF refers to the attributes from more than one child in join condition") {


I would add a JIRA prefix here - this change sounds more like fixing a particular problem.

Got it, will add in this commit.

mgaido91 · 2018-09-04T07:56:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          // the new join conditions, if all conditions is unevaluable, we should
+          // change the join type to CrossJoin.
+          val newJoinType =
+            if (commonJoinCondition.nonEmpty && newJoinCond.isEmpty) {


I think this should be done only in this case: #22326 (comment)

Thanks for reminding, crossJoinEnable should be checked here.

mgaido91 · 2018-09-04T07:57:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+            if (commonJoinCondition.nonEmpty && newJoinCond.isEmpty) {
+              logWarning(s"The whole commonJoinCondition:$commonJoinCondition of the join " +
+                s"plan:\n $j is unevaluable, it will be ignored and the join plan will be " +
+                s"turned to cross join.")


s should be removed

Thanks, done in a86a7d5.

mgaido91 · 2018-09-04T07:57:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          val newJoinType =
+            if (commonJoinCondition.nonEmpty && newJoinCond.isEmpty) {
+              logWarning(s"The whole commonJoinCondition:$commonJoinCondition of the join " +
+                s"plan:\n $j is unevaluable, it will be ignored and the join plan will be " +


not sure that inlining the plan here makes this warning very readable...

The log will be shown like this:

16:13:35.218 WARN org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin: The whole commonJoinCondition:List((dummyUDF(a#5, b#6) = dummyUDF(d#15, c#14))) of the join plan: Join Inner, (dummyUDF(a#5, b#6) = dummyUDF(d#15, c#14)) :- LocalRelation [a#5, b#6] +- LocalRelation [c#14, d#15] is unevaluable, it will be ignored and the join plan will be turned to cross join.

yes, and if the plan is big, than this would become quite unreadable IMHO. I think it would be better to refactor the message and put the plan at the end.

@xuanyuanking @mgaido91 In the above example, the UDF's refer to attributes from distinct legs of the join. Can we not plan this better than a cross join in this case ? I am wondering why we can't do -

Join Inner, leftAlias1 = rightAlias1 Project dummyUDF(a, b) as leftAlias1 LocalRelation(a, b) Project dummyUDF(c, d) as rightAlias1 LocalRelation(c, d)

Perhaps i am missing something ..

@dilipbiswal I haven't checked the particular plan posted in that comment, for which I think you are right, we can handle as you suggested, but I was checking the case in the UT and in the description of this PR, ie. when the input for the Python UDF contains attributes from both sides. In that case I don't have a better suggestion.

@mgaido91 Thanks. Marco, do you know if there are instances when we pick cross join implicitly ? It wouldn't perform very well, right ? Wondering if we should error out or pick a bad plan. I guess, like you, i am not sure whats the right thing to do here.

One other thing marco, so for join types other than inner and leftsemi, we still have the same issue, no ?

@dilipbiswal Thanks for your detailed check, I should write the case more typical, here the case we want to solve is UDF which accessing the attribute in both side, I'll change the case to dummyPythonUDF(col("a"), col("c")) === dummyPythonUDF(col("d"), col("c")) in next commit.

yes, and if the plan is big, than this would become quite unreadable IMHO. I think it would be better to refactor the message and put the plan at the end.

@mgaido91 Thanks for your advise, will do the refactor in next commit.

@dilipbiswal there are cases when "trivial conditions" are removed from a join so we make a inner join a cross one for instance. The performance would be awful, you're right. The point is that I am not sure that there is a better way to achieve this. I mean, since we have no clue what the UDF does, we need to compare all the rows from both sides, ie. we need to perform a cartesian product.

Wondering if we should error out or pick a bad plan

This is, indeed, arguable. I think that letting the user choose is a good idea. If the users runs the query and gets an AnalysisException because he/she is trying to perform a cartesian product, he/she can decide: ok, I am doing a wrong thing, let's change it; or he/she can say, well, one of my 2 tables involved contains 10 rows, not a big deal, I want to perform it nonetheless, let's set spark.sql.crossJoin.enabled=true and run it.

for join types other than inner and leftsemi, we still have the same issue, no ?

I think the current PR handles properly only the case with type inner (for the left semi) this PR returns an incorrect result IIUC. This needs to be fixed as well.

@mgaido91

This is, indeed, arguable. I think that letting the user choose is a good idea. If the users runs the query and gets an AnalysisException because he/she is trying to perform a cartesian product, he/she can decide: ok, I am doing a wrong thing, let's change it; or he/she can say, well, one of my 2 tables involved contains 10 rows, not a big deal, I want to perform it nonetheless, let's set spark.sql.crossJoin.enabled=true and run it.

Sounds reasonable ..

mgaido91 · 2018-09-04T07:58:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+                s"plan:\n $j is unevaluable, it will be ignored and the join plan will be " +
+                s"turned to cross join.")
+              Cross
+            } else joinType


} else { joinType }

Thanks, done in a86a7d5.

SparkQA · 2018-09-04T11:36:50Z

Test build #95662 has finished for PR 22326 at commit d58f3a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-09-04T14:20:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+
+          val join = Join(newLeft, newRight, newJoinType, newJoinCond)
+          if (others.nonEmpty) {
+            Filter(others.reduceLeft(And), join)


as pointed out by @dilipbiswal, this is correct only in the case of InnerJoin

Thanks, no need to add extra Filter in LeftSemi case.

mgaido91 · 2018-09-04T14:21:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+                // [[CheckCartesianProducts]], we throw firstly here for better readable
+                // information.
+                throw new AnalysisException("Detected the whole commonJoinCondition:" +
+                  "$commonJoinCondition of the join plan is unevaluable, we need to cast the" +


missing s

Thanks, done in 82e50d5.

mgaido91 · 2018-09-04T14:21:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+                throw new AnalysisException("Detected the whole commonJoinCondition:" +
+                  "$commonJoinCondition of the join plan is unevaluable, we need to cast the" +
+                  " join to cross join by setting the configuration variable" +
+                  " spark.sql.crossJoin.enabled = true.")


What about using the conf val in SQLConf?

Make sense, also change this in CheckCartesianProducts. Done in 82e50d5.

mgaido91 · 2018-09-04T15:57:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          if (others.nonEmpty && joinType.isInstanceOf[InnerLike]) {
+            Filter(others.reduceLeft(And), join)
+          } else {
+            join


this means that we are removing without doing anything the condition when we have a SemiJoin. This is wrong. All this logic can be applied only to the Inner case. In the other cases, this fix is wrong. Moreover, please add a UT to enforce the correctness in the case LeftSemi join, so we can be sure that a wrong fix doesn't go in. Thanks.

Thanks, I'll do more test on the SemiJoin here, but as currently test over PySpark, this is not wrong, maybe I misunderstand you two wrong means, is your wrong means correctness or just benchmark regression?

It means that in the left_semi join the output of the Join operator should contain only the attributes from the left side, so attributes from the right side should not be referenced after the join. Therefore the plan should be invalid. I am a bit surprised that works, it would be great to understand why. Thanks.

Thanks @mgaido91 and @dilipbiswal !
I fix this in 63fbcce. The mainly problem is semi join in both deterministic and non-deterministic condition, filter after semi join will fail. Also add more tests both on python and scala side, including semi join, inner join and complex scenario described below.
It makes the strategy difficult to read after considering left semi, so in 63fbcce I split the logic of semi join and inner join.

I am a bit surprised that works, it would be great to understand why. Thanks.

Sorry for the bad test, that's too special and the result just right by accident. The original implement will make all semi join return [] in PySpark.

SparkQA · 2018-09-04T16:34:23Z

Test build #95676 has finished for PR 22326 at commit a86a7d5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-04T18:03:07Z

Test build #95678 has finished for PR 22326 at commit 82e50d5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-05T13:10:37Z

Test build #95716 has finished for PR 22326 at commit 63fbcce.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-05T17:10:07Z

Test build #95718 has finished for PR 22326 at commit 777b881.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-09-07T01:59:49Z

Gental ping @mgaido91 @HyukjinKwon @dilipbiswal, great thanks for advice, please have a look when you have time.

HyukjinKwon · 2018-09-07T06:51:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      Cross
+    } else {
+      // if the crossJoinEnabled is false, an AnalysisException will throw by
+      // [[CheckCartesianProducts]], we throw firstly here for better readable


tiny nit: we don't necessarily [[..]] in inlined comments. We can just leave it as is or `...` if you feel like you should. Feel free to address this with other comments.

Thanks, done in 87440b0. I'll also pay attention in future work.

HyukjinKwon · 2018-09-07T06:54:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          }
+        case _: InnerLike =>
+          // push down the single side only join filter for both sides sub queries
+          val newLeft = leftJoinConditions.


Can we deduplicate the codes here?

No problem, done in 87440b0.

xuanyuanking · 2018-09-07T08:03:51Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

+    }
+  }
+
+  test("join condition pushdown: deterministic and non-deterministic in left semi join") {


I didn't add SPARK-25314 cause it maybe a supplement for test("join condition pushdown: deterministic and non-deterministic").

mgaido91 · 2018-09-07T08:44:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+   * Generate new left and right child of join by pushing down the side only join filter,
+   * split commonJoinCondition based on the expression can be evaluated within join or not.
+   *
+   * @return (newLeftChild, newRightChild, newJoinCondition, conditionCannotEvaluateWithinJoin)


nit: this is not very useful, we can see that these are the names returned...

Got it, just see the demo here https://github.com/apache/spark/pull/22326/files#diff-a636a87d8843eeccca90140be91d4fafR1140, remove in next commit.

mgaido91 · 2018-09-07T08:50:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-          Join(newLeft, newRight, joinType, newJoinCond)
+          val join = Join(newLeft, newRight, newJoinType, newJoinCond)
+          if (others.nonEmpty) {
+            Project(newLeft.output.map(_.toAttribute), Filter(others.reduceLeft(And), join))


before the patch we are not doing this pojection. I am not sure why.

cc @hvanhovell @davies who I see worked on this previously.

Could I try to answer this? The projection only used in a left semi join after cross join in this scenario for ensuring it only contains left side attributes.

mgaido91 · 2018-09-07T08:53:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          val (newLeft, newRight, newJoinCond, others) = getNewChildAndSplitCondForJoin(
+            j, leftJoinConditions, rightJoinConditions, commonJoinCondition)
+          // only need to add cross join when whole commonJoinCondition are unevaluable
+          val newJoinType = if (commonJoinCondition.nonEmpty && newJoinCond.isEmpty) {


mmh...why here we have this check and later for the filter we check others.nonEmpty? Shouldn't be the same?

Thanks, after a detailed checking, I change this to others.nonEmpty, this maybe an unnecessary worry about the commonJoin contains both unevaluable and evaluable condition. Also add a test in next commit to ensure this.

SparkQA · 2018-09-07T11:52:26Z

Test build #95790 has finished for PR 22326 at commit 4d546e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-07T11:52:49Z

Test build #95789 has finished for PR 22326 at commit 87440b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-09-07T16:11:08Z

nit: attibutes in the title probably mean attributes

xuanyuanking · 2018-09-08T11:01:23Z

@holdenk Thanks, sorry for the typo.

python/pyspark/sql/tests.py

SparkQA · 2018-09-08T14:59:13Z

Test build #95830 has finished for PR 22326 at commit 6875719.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-09-09T08:31:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  }
+
+  /**
+   * Generate new join by pushing down the side only join filter, split commonJoinCondition


nit: filters

mgaido91 · 2018-09-09T08:34:28Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

@@ -1153,12 +1154,35 @@ class FilterPushdownSuite extends PlanTest {
      "x.a".attr === Rand(10) && "y.b".attr === 5))
    val correctAnswer =
      x.where("x.a".attr === 5).join(y.where("y.a".attr === 5 && "y.b".attr === 5),
-        condition = Some("x.a".attr === Rand(10)))
+        joinType = Cross).where("x.a".attr === Rand(10))


this is not a change we want, right?

Yes, I changed this to let the test passing. The original thought is nondeterministic expression in join condition is not supported yet, so that's no big problem.

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

Line 105 in 0736e72

// Non-deterministic expressions are not allowed as join conditions.

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

Lines 1158 to 1159 in 0736e72

// CheckAnalysis will ensure nondeterministic expressions not appear in join condition.

// TODO support nondeterministic expressions in join condition.

But now I think I should more carefully about this and just limit the cross join changes only in PythonUDF case. WDYT? @mgaido91 .Thanks.

As the code in canEvaluateWithinJoin, we can get the scope relation : (CannotEvaluateWithinJoin = nonDeterminstic + Unevaluable) > Unevaluable > PythonUDF.
So for the safety maybe I just limit the change scope to the smallest PythonUDF only. Need some advise from you thanks :)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

Lines 104 to 120 in 0736e72

protected def canEvaluateWithinJoin(expr: Expression): Boolean = expr match {

// Non-deterministic expressions are not allowed as join conditions.

case e if !e.deterministic => false

case _: ListQuery | _: Exists =>

// A ListQuery defines the query which we want to search in an IN subquery expression.

// Currently the only way to evaluate an IN subquery is to convert it to a

// LeftSemi/LeftAnti/ExistenceJoin by `RewritePredicateSubquery` rule.

// It cannot be evaluated as part of a Join operator.

// An Exists shouldn't be push into a Join operator too.

false

case e: SubqueryExpression =>

// non-correlated subquery will be replaced as literal

e.children.isEmpty

case a: AttributeReference => true

case e: Unevaluable => false

case e => e.children.forall(canEvaluateWithinJoin)

}

cc @cloud-fan @gatorsmile @hvanhovell for advice on this. It may probably be ok, as it lets supporting a case which was not supported before. But I am not sure about the added value as performing a cross join is often an impossible operation.

Thanks @mgaido91 for the detailed review and advise, for me, I maybe choose only limited the change scope to pythonUDF only or at lease Unevaluable only. Waiting for others advice.

cloud-fan · 2018-09-11T11:33:06Z

IIUC, you are pulling out the join condition with python UDF and create a filter above join. Then the join become a cross join, which usually runs very slowly. I think we should keep the cross join check for this case.

xuanyuanking · 2018-09-11T15:44:39Z

@cloud-fan Thanks for your comment.

IIUC, you are pulling out the join condition with python UDF and create a filter above join. Then the join become a cross join, which usually runs very slowly.

Yes, that's right.

I think we should keep the cross join check for this case.

Yes, as Marco suggestion, the currently behavior is control cross join by crossJoinEnabled config, if crossJoinEnabled = false, it will throw AnalysisException.

xuanyuanking · 2018-09-22T15:52:22Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -995,7 +995,8 @@ class Dataset[T] private[sql](
    // After the cloning, left and right side will have distinct expression ids.
    val plan = withPlan(
      Join(logicalPlan, right.logicalPlan, JoinType(joinType), Some(joinExprs.expr)))
-      .queryExecution.analyzed.asInstanceOf[Join]
+      .queryExecution.analyzed
+    val joinPlan = plan.collectFirst { case j: Join => j }.get


For reviewer, we need this change cause the rule HandlePythonUDFInJoinCondition will break the assumption about the join plan after analyzing will only return Join. After we add the rule of handling python udf, we'll add filter or project node on top of Join.

SparkQA · 2018-09-26T11:29:08Z

Test build #96617 has finished for PR 22326 at commit 98cd3cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-26T13:27:56Z

Test build #96621 has finished for PR 22326 at commit d1db33a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-09-26T13:46:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

+/**
+ * PythonUDF in join condition can not be evaluated, this rule will detect the PythonUDF
+ * and pull them out from join condition. For python udf accessing attributes from only one side,
+ * they would be pushed down by operation push down rules. If not(e.g. user disables filter push


nits:

they are

missing space before (

Thanks, done in d2739af.

mgaido91 · 2018-09-26T13:48:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

+          s" $joinType is not supported.")
+      }
+      // If condition expression contains python udf, it will be moved out from
+      // the new join conditions. If join condition has python udf only, it will be turned


I think we don't need here the second sentence, ie. the one startng with If join condition ...

Make sense, duplicate with log. Done in d2739af.

mgaido91 · 2018-09-26T13:49:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

+      // the new join conditions. If join condition has python udf only, it will be turned
+      // to cross join and the crossJoinEnable will be checked in CheckCartesianProducts.
+      val (udf, rest) =
+        condition.map(splitConjunctivePredicates).get.partition(hasPythonUDF)


nit: -> splitConjunctivePredicates(condition.get).partition(...) seems more clear to me

Thanks, done in d2739af.

mgaido91 · 2018-09-26T13:51:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

+      val newCondition = if (rest.isEmpty) {
+        logWarning(s"The join condition:$condition of the join plan contains " +
+          "PythonUDF only, it will be moved out and the join plan will be turned to cross " +
+          s"join. This plan shows below:\n $j")


can we at least remove the whole plan from the warning? Plans can be pretty big...

Got it, done in d2739af.

mgaido91 · 2018-09-26T13:51:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala

+        case LeftSemi =>
+          Project(
+            j.left.output.map(_.toAttribute),
+              Filter(udf.reduceLeft(And), newJoin.copy(joinType = Inner)))


nit: indentation

Thanks, done in d2739af.

mgaido91

Just one comment, other than that my only concern is: here we are introducing a lot of end-to-end tests and we have no test targeting only the newly introduced optimizer rule. So I'd prefer having one or 2 end-to-end tests and create a new suite testing only the rule and the plan transformation, both for having lower testing time and finer grained tests checking that the output plan is indeed the expected one (not only checking the result of the query).

Apart from this, the change looks fine to me.

mgaido91 · 2018-09-26T14:28:12Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

+    }
+    assert(errMsg.getMessage.startsWith("Detected implicit cartesian product"))
+    // Test with spark.sql.crossJoin.enabled=true
+    spark.conf.set("spark.sql.crossJoin.enabled", "true")


please use withSQLConf

Thanks, done in 7f66954.

So I'd prefer having one or 2 end-to-end tests and create a new suite testing only the rule and the plan transformation, both for having lower testing time and finer grained tests checking that the output plan is indeed the expected one (not only checking the result of the query).

Make sense, will add a plan test for this rule.

SparkQA · 2018-09-26T18:08:12Z

Test build #96630 has finished for PR 22326 at commit 87f0f50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-26T18:24:26Z

Test build #96631 has finished for PR 22326 at commit d2739af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-26T19:59:26Z

Test build #96635 has finished for PR 22326 at commit 7f66954.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-27T01:16:48Z

python/pyspark/sql/tests.py

+        with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
+            self.assertEqual(df.collect(), [Row(a=1, a1=1, a2=1)])
+
+    def test_udf_and_filter_in_join_condition(self):


This test (and the corresponding one for left semi join) is not very useful. The filter in join condition will be pushed down so this test is basically same as the test_udf_in_join_condition.

Make sense, just for checking during implement, delete both in 2b6977d.

cloud-fan · 2018-09-27T01:22:47Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

@@ -100,6 +105,29 @@ class BatchEvalPythonExecSuite extends SparkPlanTest with SharedSQLContext {
    }
    assert(qualifiedPlanNodes.size == 1)
  }
+
+  test("SPARK-25314: Python UDF refers to the attributes from more than one child " +


This is still an end-to-end test, I don't think we need it

Got it, I use this for IDE mock python UDF, will do this in a follow up PR with a new test suites in org.apache.spark.sql.catalyst.optimizer, revert in 2b6977d.

cloud-fan · 2018-09-27T01:26:21Z

LGTM except some unnecessary end-to-end tests.

+1 for @mgaido91 's idea about unit test, something like the test suites under org.apache.spark.sql.catalyst.optimizer. I'm ok to do it in a followup, since it will be the first UT for python rules.

SparkQA · 2018-09-27T06:30:38Z

Test build #96659 has finished for PR 22326 at commit 2b6977d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-27T07:13:46Z

thanks, merging to master/2.4!

… of join in join conditions ## What changes were proposed in this pull request? Thanks for bahchis reporting this. It is more like a follow up work for #16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition. ## How was this patch tested? Add regression tests in PySpark and `BatchEvalPythonExecSuite`. Closes #22326 from xuanyuanking/SPARK-25314. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2a8cbfd) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

xuanyuanking · 2018-09-27T18:07:31Z

Thanks everyone for your review and advise.

… of join in join conditions ## What changes were proposed in this pull request? Thanks for bahchis reporting this. It is more like a follow up work for apache#16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition. ## How was this patch tested? Add regression tests in PySpark and `BatchEvalPythonExecSuite`. Closes apache#22326 from xuanyuanking/SPARK-25314. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

HyukjinKwon · 2018-10-29T05:38:06Z

late LGTM

…dition #22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. a new test Closes #23153 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit affe809) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? As comment in apache#22326 (comment), we test the new added optimizer rule by end-to-end test in python side, need to add suites under `org.apache.spark.sql.catalyst.optimizer` like other optimizer rules. ## How was this patch tested? new added UT Closes apache#22955 from xuanyuanking/SPARK-25949. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…dition ## What changes were proposed in this pull request? apache#22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. ## How was this patch tested? a new test Closes apache#23153 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…dition apache#22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. a new test Closes apache#23153 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit affe809) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…dition apache/spark#22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. a new test Closes #23153 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit affe809) Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ac26a1d)

HyukjinKwon reviewed Sep 4, 2018

View reviewed changes

xuanyuanking commented Sep 4, 2018

View reviewed changes

python/pyspark/sql/tests.py Show resolved Hide resolved

HyukjinKwon reviewed Sep 4, 2018

View reviewed changes

mgaido91 reviewed Sep 4, 2018

View reviewed changes

HyukjinKwon reviewed Sep 7, 2018

View reviewed changes

xuanyuanking commented Sep 7, 2018

View reviewed changes

mgaido91 reviewed Sep 7, 2018

View reviewed changes

xuanyuanking changed the title ~~[SPARK-25314][SQL] Fix Python UDF accessing attibutes from both side of join in join conditions~~ [SPARK-25314][SQL] Fix Python UDF accessing attributes from both side of join in join conditions Sep 8, 2018

xuanyuanking commented Sep 8, 2018

View reviewed changes

python/pyspark/sql/tests.py Show resolved Hide resolved

mgaido91 reviewed Sep 9, 2018

View reviewed changes

xuanyuanking commented Sep 22, 2018

View reviewed changes

Address comments

87f0f50

mgaido91 reviewed Sep 26, 2018

View reviewed changes

Address comments from Marco

d2739af

mgaido91 reviewed Sep 26, 2018

View reviewed changes

Address comment

7f66954

cloud-fan reviewed Sep 27, 2018

View reviewed changes

Delete unnecessary end-to-end tests

2b6977d

asfgit closed this in 2a8cbfd Sep 27, 2018

xuanyuanking deleted the SPARK-25314 branch September 27, 2018 18:07

xuanyuanking mentioned this pull request Nov 6, 2018

[SPARK-25949][SQL] Add test for PullOutPythonUDFInJoinCondition #22955

Closed

cloud-fan mentioned this pull request Nov 27, 2018

[SPARK-26147][SQL] only pull out unevaluable python udf from join condition #23153

Closed

	// CheckAnalysis will ensure nondeterministic expressions not appear in join condition.
	// TODO support nondeterministic expressions in join condition.

	protected def canEvaluateWithinJoin(expr: Expression): Boolean = expr match {
	// Non-deterministic expressions are not allowed as join conditions.
	case e if !e.deterministic => false
	case _: ListQuery \| _: Exists =>
	// A ListQuery defines the query which we want to search in an IN subquery expression.
	// Currently the only way to evaluate an IN subquery is to convert it to a
	// LeftSemi/LeftAnti/ExistenceJoin by `RewritePredicateSubquery` rule.
	// It cannot be evaluated as part of a Join operator.
	// An Exists shouldn't be push into a Join operator too.
	false
	case e: SubqueryExpression =>
	// non-correlated subquery will be replaced as literal
	e.children.isEmpty
	case a: AttributeReference => true
	case e: Unevaluable => false
	case e => e.children.forall(canEvaluateWithinJoin)
	}

[SPARK-25314][SQL] Fix Python UDF accessing attributes from both side of join in join conditions #22326

[SPARK-25314][SQL] Fix Python UDF accessing attributes from both side of join in join conditions #22326

Uh oh!

Conversation

xuanyuanking commented Sep 4, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Sep 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Sep 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Sep 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

xuanyuanking Sep 4, 2018 •

edited

Loading

dilipbiswal Sep 4, 2018 •

edited

Loading