-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend argument types for udf return_type_from_exprs
#9522
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
return_type_from_exprs
return_type_from_exprs
return_type_from_exprs
) -> Result<Arc<dyn PhysicalExpr>> { | ||
let input_expr_types = input_phy_exprs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check the signature for udf too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea was at this point the type checking had already been done, so there is no reason to check the types again.
However, I looked at how non udf functions work, and they also seem to check the data types again, so the change in this PR is consistent:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean they are checked in logical expr once, so we dont need to double check them in physical expr?
If so, we can remove both buitlin & udf in physical expr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to check signature if someone create physcial expr directly (might be possible?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me merge this first and move on #9504
@@ -655,6 +655,7 @@ impl ScalarUDFImpl for TakeUDF { | |||
&self, | |||
arg_exprs: &[Expr], | |||
schema: &dyn ExprSchema, | |||
_arg_data_types: &[DataType], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems unused 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's right for usage, for return_type_from_exprs
user would calculate the return_type form Expr
and Schema
. And for a user defined return_type_from_exprs
, it would not re-calculate the arg_data_types
from args
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _arg_data_types
is not provided, we calculate them from args
and schema
. They are mutual exclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(note this is a test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jayzhan211
I previously did some work on return_type_from_exprs
so I want to help review.😄
I think the issue is because here already got the arg_data_types and later in return_type_from_exprs
it calculated again? So this PR would not re-calculate the data_types. https://github.com/apache/arrow-datafusion/blob/eebdbe8be61ba1944b7a7e14d3edeed0259a7d74/datafusion/expr/src/expr_schema.rs#L119-L122
Maybe in the future after we don't need BuiltIn Scalar Function we can remove these lines(for it's only used for BuiltIn) and it could also fix the issue? I'm not sure what's the real reason. 🤔 But currently I think this PR resolve it.
datafusion/expr/src/udf.rs
Outdated
.map(|arg| arg.get_type(schema)) | ||
.collect::<Result<Vec<_>>>()?; | ||
self.return_type(&arg_types) | ||
if arg_types.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if arg_types
is empty, would here also get the empty arg_types
since this param is calculated in the same way before? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if arg_types
is not provided, we can get them from args
and schema
, and it should not be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will arg_types
not be provided?
I think in the current PR, we will pass always pass arg_types
to the default implementation of return_type_from_exprs
, thus it may be alway non-empty?
By default, they should get data types with these lines, so we still need them even all the functions are udf-based. let arg_data_types = args
.iter()
.map(|e| e.get_type(schema))
.collect::<Result<Vec<_>>>()?; If the udf functions need user-defined return type, they can implement their own |
Actually, I think we don't even need |
Yeah, I understand that. What I mean is the default implementation for Expr::ScalarFunction(ScalarFunction { func_def, args }) => {
match func_def {
ScalarFunctionDefinition::UDF(fun) => {
// Maybe we can move the check into return_type_from_exprs?
// verify that function is invoked with correct number and type of arguments as defined in `TypeSignature`
// data_types(&arg_data_types, fun.signature()).map_err(|_| {
// plan_datafusion_err!(
// "{}",
// utils::generate_signature_error_msg(
// fun.name(),
// fun.signature().clone(),
// &arg_data_types,
// )
// )
// })?;
// perform additional function arguments validation (due to limited
// expressiveness of `TypeSignature`), then infer return type
Ok(fun.return_type_from_exprs(args, schema)?)
}
ScalarFunctionDefinition::Name(_) => {
internal_err!("Function `Expr` with name should be resolved.")
}
}
}
|
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
) -> Result<DataType> { | ||
let arg_types = args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems types are always pre-computed, we don't need to check emptiness
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Note that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 -- this API change makes sense to me. The writeup and discussion of the rationale also makes lots of sense and was easy to read
The only thing I am not sure about is the need to re-check types in pub fn create_physical_expr(
-- I left a comment about that lower down. However, I think the change is ok.
) -> Result<Arc<dyn PhysicalExpr>> { | ||
let input_expr_types = input_phy_exprs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea was at this point the type checking had already been done, so there is no reason to check the types again.
However, I looked at how non udf functions work, and they also seem to check the data types again, so the change in this PR is consistent:
schema: &dyn ExprSchema, | ||
_args: &[Expr], | ||
_schema: &dyn ExprSchema, | ||
arg_types: &[DataType], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the documentation of this API, and I think in general it is clear and no changes are needed
@@ -655,6 +655,7 @@ impl ScalarUDFImpl for TakeUDF { | |||
&self, | |||
arg_exprs: &[Expr], | |||
schema: &dyn ExprSchema, | |||
_arg_data_types: &[DataType], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(note this is a test)
Technically this is an API change, but since https://github.com/apache/arrow-datafusion/blob/main/dev/changelog/36.0.0.md marking it as api-change though just to be sure |
Which issue does this PR close?
Closes #9518
Ends with approach 2.
Follow up change from #8985
Rationale for this change
To avoid recomputing argument types
What changes are included in this PR?
Extend another field for udf return type
Are these changes tested?
Only existing test
Are there any user-facing changes?
Relatively new function , acceptable breaking change