-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-48845][SQL] GenericUDF catch exceptions from children #47268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48845][SQL] GenericUDF catch exceptions from children #47268
Conversation
f097982
to
eb2b0a2
Compare
…xception_from_child_func
cc @cloud-fan @yaooqinn @panbingkun FYI |
Can we explain the execution path before Spark 3.5? Is this PR trying to follow the previous execution path? |
For Spark 3.3, the |
Is it possible to combine the code generation and non-code generation code? |
AFAIK, it's hard to set lazy function to DeferredObject. I have tried with lambda functions and inner classes to put Any better ideas? |
@@ -131,6 +131,13 @@ class HiveGenericUDFEvaluator( | |||
def setArg(index: Int, arg: Any): Unit = | |||
deferredObjects(index).asInstanceOf[DeferredObjectAdapter].set(arg) | |||
|
|||
def setFuncArg(index: Int, arg: () => Any): Unit = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that setFuncArg
and setException
can be merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little ugly that we use inner class in codegen. If we use setFuncArg
, the codegen for exceptions would be follow code.
s"""
|try {
| ${eval.code}
| if (${eval.isNull}) {
| $refEvaluator.setArg($i, null);
| } else {
| $refEvaluator.setArg($i, ${eval.value});
| }
|} catch (Exception exp) {
| final exception = exp;
| $refEvaluator.setFundArg($i, new scala.Function0<Object>() {
| public Object apply() {
| throw exception;
| }
| });
|}
|""".stripMargin
| $refEvaluator.setArg($i, null); | ||
| } else { | ||
| $refEvaluator.setArg($i, ${eval.value}); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It seems that
catch (...)
is only related to${eval.code}
- Why is it
Exception
instead ofThrowable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that catch (...) is only related to ${eval.code}
Yes, we should catch the exceptions from child's executions.
Why is it Exception instead of Throwable?
The Exception can catch most of throwable problems. It's ok to be change to Throwable.
Will it cover |
Yes, it cover the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from myside
|} else { | ||
| $refEvaluator.setArg($i, ${eval.value}); | ||
|try { | ||
| ${eval.code} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly different from the interpreted path. In codegen we still eagerly execute the function inputs, but delay the throw of errors. I understand that it's hard to be truly lazy in codegen, can we update the interpreted path instead to make it the same as codegen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure and done. I have updated the code for non-codegen mode and the description for this pr.
test("SPARK-48845: GenericUDF catch exceptions from child UDFs") { | ||
withTable("test_catch_exception") { | ||
Seq("9", "9-1").toDF("a").write.saveAsTable("test_catch_exception") | ||
sql("create temporary function udf_exception as " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's wrap with withUserDefinedFunction
to clean up the functions at the end.
sql/hive/src/test/java/org/apache/spark/sql/hive/execution/UDFCatchException.java
Outdated
Show resolved
Hide resolved
sql/hive/src/test/java/org/apache/spark/sql/hive/execution/UDFException.java
Outdated
Show resolved
Hide resolved
sql/hive/src/test/java/org/apache/spark/sql/hive/execution/UDFException.java
Outdated
Show resolved
Hide resolved
Regarding the failure of GA, I guess you need to rebase master. 😄 |
…_exception_from_child_func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@panbingkun any other comments? |
No, go ahead. |
sql/hive/src/test/java/org/apache/spark/sql/hive/execution/UDFThrowException.java
Outdated
Show resolved
Hide resolved
…ThrowException.java
sql/hive/src/test/java/org/apache/spark/sql/hive/execution/UDFThrowException.java
Outdated
Show resolved
Hide resolved
…ThrowException.java
|
Oh my bad @LuciferYang |
### What changes were proposed in this pull request? This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences. Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail. ``` select udf_catch_exception(udf_exception(col1)) from table ``` ### Why are the changes needed? For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in `function.evaluate(deferredObjects)`. Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47268 from jackylee-ch/generic_udf_catch_exception_from_child_func. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 236d957) Signed-off-by: Kent Yao <yao@apache.org>
Thank you all. Merged to master and branch-3.5 for the coming 3.5.2 |
Thanks for your review. @LuciferYang @yaooqinn @cloud-fan @panbingkun |
### What changes were proposed in this pull request? This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences. Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail. ``` select udf_catch_exception(udf_exception(col1)) from table ``` ### Why are the changes needed? For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in `function.evaluate(deferredObjects)`. Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47268 from jackylee-ch/generic_udf_catch_exception_from_child_func. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request? This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences. Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail. ``` select udf_catch_exception(udf_exception(col1)) from table ``` ### Why are the changes needed? For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in `function.evaluate(deferredObjects)`. Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47268 from jackylee-ch/generic_udf_catch_exception_from_child_func. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request? This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences. Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail. ``` select udf_catch_exception(udf_exception(col1)) from table ``` ### Why are the changes needed? For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in `function.evaluate(deferredObjects)`. Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47268 from jackylee-ch/generic_udf_catch_exception_from_child_func. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
What changes were proposed in this pull request?
This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences.
Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail.
Why are the changes needed?
For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in
function.evaluate(deferredObjects)
.Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Newly added UT.
Was this patch authored or co-authored using generative AI tooling?
No.