Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend argument types for udf return_type_from_exprs #9522

Merged
merged 3 commits into from
Mar 10, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Mar 10, 2024

Which issue does this PR close?

Closes #9518

Ends with approach 2.

Follow up change from #8985

Rationale for this change

To avoid recomputing argument types

What changes are included in this PR?

Extend another field for udf return type

Are these changes tested?

Only existing test

Are there any user-facing changes?

Relatively new function , acceptable breaking change

    pub fn return_type_from_exprs(
        &self,
        args: &[Expr],
        schema: &dyn ExprSchema,
        arg_types: &[DataType],
    ) -> Result<DataType> {

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate labels Mar 10, 2024
@jayzhan211 jayzhan211 changed the title Extend argument types for udf return type function Extend argument types for udf return_type_from_exprs Mar 10, 2024
@jayzhan211 jayzhan211 changed the title Extend argument types for udf return_type_from_exprs Extend argument types for udf return_type_from_exprs Mar 10, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review March 10, 2024 02:48
) -> Result<Arc<dyn PhysicalExpr>> {
let input_expr_types = input_phy_exprs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check the signature for udf too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea was at this point the type checking had already been done, so there is no reason to check the types again.

However, I looked at how non udf functions work, and they also seem to check the data types again, so the change in this PR is consistent:

https://github.com/apache/arrow-datafusion/blob/6710e6db38d3f5292e6d9506c70c153c2e76f339/datafusion/physical-expr/src/functions.rs#L62-L84

Copy link
Contributor Author

@jayzhan211 jayzhan211 Mar 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean they are checked in logical expr once, so we dont need to double check them in physical expr?
If so, we can remove both buitlin & udf in physical expr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to check signature if someone create physcial expr directly (might be possible?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me merge this first and move on #9504

@@ -655,6 +655,7 @@ impl ScalarUDFImpl for TakeUDF {
&self,
arg_exprs: &[Expr],
schema: &dyn ExprSchema,
_arg_data_types: &[DataType],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems unused 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's right for usage, for return_type_from_exprs user would calculate the return_type form Expr and Schema. And for a user defined return_type_from_exprs, it would not re-calculate the arg_data_types from args.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if _arg_data_types is not provided, we calculate them from args and schema. They are mutual exclusive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note this is a test)

Copy link
Contributor

@yyy1000 yyy1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jayzhan211
I previously did some work on return_type_from_exprs so I want to help review.😄
I think the issue is because here already got the arg_data_types and later in return_type_from_exprs it calculated again? So this PR would not re-calculate the data_types. https://github.com/apache/arrow-datafusion/blob/eebdbe8be61ba1944b7a7e14d3edeed0259a7d74/datafusion/expr/src/expr_schema.rs#L119-L122
Maybe in the future after we don't need BuiltIn Scalar Function we can remove these lines(for it's only used for BuiltIn) and it could also fix the issue? I'm not sure what's the real reason. 🤔 But currently I think this PR resolve it.

.map(|arg| arg.get_type(schema))
.collect::<Result<Vec<_>>>()?;
self.return_type(&arg_types)
if arg_types.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if arg_types is empty, would here also get the empty arg_types since this param is calculated in the same way before? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if arg_types is not provided, we can get them from args and schema, and it should not be empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will arg_types not be provided?
I think in the current PR, we will pass always pass arg_types to the default implementation of return_type_from_exprs, thus it may be alway non-empty?

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Mar 10, 2024

Maybe in the future after we don't need BuiltIn Scalar Function we can remove these lines(for it's only used for BuiltIn) and it could also fix the issue? I'm not sure what's the real reason. 🤔 But currently I think this PR resolve it.

By default, they should get data types with these lines, so we still need them even all the functions are udf-based.

let arg_data_types = args 
     .iter() 
     .map(|e| e.get_type(schema)) 
     .collect::<Result<Vec<_>>>()?; 

If the udf functions need user-defined return type, they can implement their own return_type.

@jayzhan211
Copy link
Contributor Author

Actually, I think we don't even need return_type_from_exprs, we should combine it with return_type with args, schema, arg_types.

@yyy1000
Copy link
Contributor

yyy1000 commented Mar 10, 2024

By default, they should get data types with these lines, so we still need them even all the functions are udf-based.
If the udf functions need user-defined return type, they can implement their own return_type.

Yeah, I understand that. What I mean is the default implementation for return_type_from_exprs also get data types with these lines. Maybe we can delete them (if all the functions are udf-based) and the return_type_from_exprs would keep the same as before, so the args_type are only computed once?

Expr::ScalarFunction(ScalarFunction { func_def, args }) => {
               match func_def {
                   ScalarFunctionDefinition::UDF(fun) => {
                       // Maybe we can move the check into return_type_from_exprs?
                       // verify that function is invoked with correct number and type of arguments as defined in `TypeSignature`
                       // data_types(&arg_data_types, fun.signature()).map_err(|_| {
                          //  plan_datafusion_err!(
                          //     "{}",
                           //   utils::generate_signature_error_msg(
                              //     fun.name(),
                              //     fun.signature().clone(),
                             //      &arg_data_types,
                           //    )
                         //  )
                      //  })?;


                       // perform additional function arguments validation (due to limited
                       // expressiveness of `TypeSignature`), then infer return type
                       Ok(fun.return_type_from_exprs(args, schema)?)
                   }
                   ScalarFunctionDefinition::Name(_) => {
                       internal_err!("Function `Expr` with name should be resolved.")
                   }
               }
           }

Actually, I think we don't even need return_type_from_exprs, we should combine it with return_type with args, schema, arg_types.
That sounds good. :) I'd like to see what it will be like

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
) -> Result<DataType> {
let arg_types = args
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems types are always pre-computed, we don't need to check emptiness

@jayzhan211 jayzhan211 marked this pull request as draft March 10, 2024 05:50
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Mar 10, 2024

Yeah, I understand that. What I mean is the default implementation for return_type_from_exprs also get data types with these lines. Maybe we can delete them (if all the functions are udf-based) and the return_type_from_exprs would keep the same as before, so the args_type are only computed once?

Note that return_type_from_exprs is used in physical-expr too. args types are computed differently than the logical one, so we may need to move the get_type computation out of return_type_from_exprs

@jayzhan211 jayzhan211 marked this pull request as ready for review March 10, 2024 06:54
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- this API change makes sense to me. The writeup and discussion of the rationale also makes lots of sense and was easy to read

The only thing I am not sure about is the need to re-check types in pub fn create_physical_expr( -- I left a comment about that lower down. However, I think the change is ok.

) -> Result<Arc<dyn PhysicalExpr>> {
let input_expr_types = input_phy_exprs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea was at this point the type checking had already been done, so there is no reason to check the types again.

However, I looked at how non udf functions work, and they also seem to check the data types again, so the change in this PR is consistent:

https://github.com/apache/arrow-datafusion/blob/6710e6db38d3f5292e6d9506c70c153c2e76f339/datafusion/physical-expr/src/functions.rs#L62-L84

schema: &dyn ExprSchema,
_args: &[Expr],
_schema: &dyn ExprSchema,
arg_types: &[DataType],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the documentation of this API, and I think in general it is clear and no changes are needed

@@ -655,6 +655,7 @@ impl ScalarUDFImpl for TakeUDF {
&self,
arg_exprs: &[Expr],
schema: &dyn ExprSchema,
_arg_data_types: &[DataType],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note this is a test)

@alamb alamb added the api change Changes the API exposed to users of the crate label Mar 10, 2024
@alamb
Copy link
Contributor

alamb commented Mar 10, 2024

Technically this is an API change, but since return_type_from_exprs doesn't seem to have been released yet, I think this doesn't actually change any released APIs.

https://github.com/apache/arrow-datafusion/blob/main/dev/changelog/36.0.0.md

marking it as api-change though just to be sure

@jayzhan211 jayzhan211 merged commit 31fcd72 into apache:main Mar 10, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache the arguments type to avoid recomputing for ScalarUDF::return_type_from_exprs
4 participants