Skip to content

[SPARK-20315] [SQL] Set ScalaUDF's deterministic to true #17626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

ScalaUDF is always assumed to deterministic, based on the previous discussion in #13087. However, the current master still sets it based on the children's deterministic values.

This PR is to correct it by always setting it to true, even if the children's deterministic values are not true.

How was this patch tested?

Added a test case.

@gatorsmile
Copy link
Member Author

cc @cloud-fan @dongjoon-hyun

@@ -45,6 +45,9 @@ case class ScalaUDF(
udfName: Option[String] = None)
extends Expression with ImplicitCastInputTypes with NonSQLExpression {

// the user-defined functions must be deterministic.
final override def deterministic: Boolean = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, although we decided that this must be true, but should we override this explicitly? Is this for allowing further optimizations like predicate pushdown?

@SparkQA
Copy link

SparkQA commented Apr 13, 2017

Test build #75752 has finished for PR 17626 at commit 4e59b67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 13, 2017

Test build #75753 has finished for PR 17626 at commit 03303a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Apr 13, 2017

Hmm, I think it is a bit different between the deterministic assumption on UDF functions and the deterministic of ScalaUDF.

Even your UDF functions are deterministic, if the input expressions are not deterministic, the result of this ScalaUDF isn't deterministic.

@gatorsmile
Copy link
Member Author

Even if the input expressions are not deterministic, the output could be still deterministic.

If we already make an assumption that ScalaUDF is deterministic, we should make it behave consistent.

@viirya
Copy link
Member

viirya commented Apr 13, 2017

udf(x, y) = x + y looks like a deterministic UDF function. Is this udf(rand(), rand()) deterministic?

@gatorsmile
Copy link
Member Author

A simple example, udf(x, y) = 1 is deterministic no matter whether x or y is deterministic or not.

@gatorsmile
Copy link
Member Author

The deterministic of our Scala udaf has exactly the same logics and Hive UDF/UDAF... have the same logics too

@viirya
Copy link
Member

viirya commented Apr 13, 2017

udf(x, y) = 1 is deterministic no matter whether x or y is deterministic or not, is because x, y are not used, in other words they don't affect the result of the udf.

The result of udf(x, y) = x + y is deterministic or not, depending on whether x and y are deterministic too. The result of udf(rand(), rand()) is not deterministic, for example.

I think ScalaUDAF is a bit different to ScalaUDF, because we don't push down an aggregate function. But we will push down a ScalaUDF if it is deterministic. Pushing down it might cause unexpected results, if you don't care if its inputs are deterministic or not.

@gatorsmile
Copy link
Member Author

gatorsmile commented Apr 13, 2017

After I rethinking about it, it really depends on how we define deterministic.

I checked the definition of Hive UDFType.

  /**
   * Certain optimizations should not be applied if UDF is not deterministic.
   * Deterministic UDF returns same result each time it is invoked with a
   * particular input. This determinism just needs to hold within the context of
   * a query.
   *
   * @return true if the UDF is deterministic
   */
  boolean deterministic() default true;

The deterministic value of an expression should always consider the children's deterministic values. All the other existing UDFs and UDAFs in Spark are not correct. Thus, I will close it and fix them in a separate PR.

@gatorsmile gatorsmile closed this Apr 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants