-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-20315] [SQL] Set ScalaUDF's deterministic to true #17626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -45,6 +45,9 @@ case class ScalaUDF( | |||
udfName: Option[String] = None) | |||
extends Expression with ImplicitCastInputTypes with NonSQLExpression { | |||
|
|||
// the user-defined functions must be deterministic. | |||
final override def deterministic: Boolean = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, although we decided that this must be true
, but should we override this explicitly? Is this for allowing further optimizations like predicate pushdown?
Test build #75752 has finished for PR 17626 at commit
|
Test build #75753 has finished for PR 17626 at commit
|
Hmm, I think it is a bit different between the deterministic assumption on UDF functions and the Even your UDF functions are deterministic, if the input expressions are not deterministic, the result of this ScalaUDF isn't deterministic. |
Even if the input expressions are not deterministic, the output could be still deterministic. If we already make an assumption that ScalaUDF is deterministic, we should make it behave consistent. |
udf(x, y) = x + y looks like a deterministic UDF function. Is this udf(rand(), rand()) deterministic? |
A simple example, |
The deterministic of our Scala udaf has exactly the same logics and Hive UDF/UDAF... have the same logics too |
The result of I think |
After I rethinking about it, it really depends on how we define I checked the definition of Hive
The deterministic value of an expression should always consider the children's deterministic values. All the other existing UDFs and UDAFs in Spark are not correct. Thus, I will close it and fix them in a separate PR. |
What changes were proposed in this pull request?
ScalaUDF is always assumed to deterministic, based on the previous discussion in #13087. However, the current master still sets it based on the children's deterministic values.
This PR is to correct it by always setting it to true, even if the children's deterministic values are not true.
How was this patch tested?
Added a test case.