Add a ScalarUDFImpl::simplfy() API, move SimplifyInfo et al to datafusion_expr#9304
Add a ScalarUDFImpl::simplfy() API, move SimplifyInfo et al to datafusion_expr#9304alamb merged 30 commits intoapache:mainfrom
ScalarUDFImpl::simplfy() API, move SimplifyInfo et al to datafusion_expr#9304Conversation
3242780 to
bba4940
Compare
| } | ||
| } | ||
|
|
||
| fn function_simplication_analyze_internal(plan: &LogicalPlan) -> Result<LogicalPlan> { |
There was a problem hiding this comment.
port from operator to function, schema is removed since not used
There was a problem hiding this comment.
I wonder if there is any reason calling simplify needs to be its own analyzer pass? Can't it be called as part of the existing simplification code?
logical_plan after simplify_expressions SAME TEXT AS ABOVE
That would also allow this function to be used by anyone else that used the simplification API directly: https://github.com/apache/arrow-datafusion/blob/8f3d1ef23f93cd4303745eba76c0850b39774d07/datafusion-examples/examples/expr_api.rs#L113
There was a problem hiding this comment.
No special reason. I forgot this analyzer exists. Let me take a look
alamb
left a comment
There was a problem hiding this comment.
Thanks @jayzhan211 -- this is looking really nice and close. Thank you so much for working on these items and I again apologize for the delay in review.
I left some comments -- let me know what you think.
| } | ||
| // Wrap with Expr::Cast() to Int64 | ||
| fn simplify(&self, args: &[Expr]) -> Result<Simplified> { | ||
| let dfs = DFSchema::new_with_metadata( |
There was a problem hiding this comment.
This is a great example, thank you @jayzhan211
It seems to me that the UDF can't always know what the input schema would be, and thus the schema should be passed in.
Something like this
fn simplify(&self, args: &[Expr], schema: &DFSchema) -> Result<Simplified> {
...
}What do you think?
There was a problem hiding this comment.
SimpliyContext has DFSchemaRef, so I change the type to DFSchemaRef, I think it is also fine
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_user_defined_functions_cast_to_i64() -> Result<()> { |
There was a problem hiding this comment.
Thank you so much for this test / example -- it makes seeing how the API would work really clear. 👏
datafusion/expr/src/udf.rs
Outdated
| fn simplify(&self, _args: &[Expr]) -> Result<Simplified> { | ||
| Ok(Simplified::Original) | ||
| } |
There was a problem hiding this comment.
Is it possible to make this take self (and avoid a copy)? Something like
| fn simplify(&self, _args: &[Expr]) -> Result<Simplified> { | |
| Ok(Simplified::Original) | |
| } | |
| fn simplify(self, _args: &[Expr]) -> Result<Simplified> { | |
| Ok(Simplified::Original(self)) | |
| } | |
| ```? |
There was a problem hiding this comment.
It seems quite hard.
There was a problem hiding this comment.
If I +Sized for UDF trait, it can't be made into an object.
| } | ||
| } | ||
|
|
||
| fn function_simplication_analyze_internal(plan: &LogicalPlan) -> Result<LogicalPlan> { |
There was a problem hiding this comment.
I wonder if there is any reason calling simplify needs to be its own analyzer pass? Can't it be called as part of the existing simplification code?
logical_plan after simplify_expressions SAME TEXT AS ABOVE
That would also allow this function to be used by anyone else that used the simplification API directly: https://github.com/apache/arrow-datafusion/blob/8f3d1ef23f93cd4303745eba76c0850b39774d07/datafusion-examples/examples/expr_api.rs#L113
|
Thanks @jayzhan211 -- I have some more thoughts on this PR and I hope to share them soon. However it may not be until tomorrow as I have a bunch of other things going on right now so I may run out of time |
|
I ran out of time (again) today, but I will put this at the top of my list for tomorrow |
datafusion/expr/src/udf.rs
Outdated
| // Do the function rewrite. | ||
| // 'args': The arguments of the function | ||
| // 'schema': The schema of the function | ||
| fn simplify(&self, _args: &[Expr], _schema: DFSchemaRef) -> Result<Simplified> { |
There was a problem hiding this comment.
I played around with this PR today.
I think the core challenge is that the Simplification code uses SimplifyContext which purposely doesn't have a reference to DFSchema. This means the simplifier can't call ScalarUDFImpl::simplify as written.
I tried several options, and here is my suggestion:
- Move SimplifyContext into datafusion_expr
- Change the signature to
fn simplify(&self, _args: &[Expr], info: &dyn SimplifyContext) -> Result<Simplified>
Then I think calling ScalarUDFImpl::simplify will end up being straightforward to call
The one challenge I found when I tried to do this locally is that SImplifyInfo currently depends on ExecutionPropss which is in datafusion-physical-expr -- we probably have to move that structure into DataFusion common or datafusion expr as well
There was a problem hiding this comment.
I guess what you are suggesting is something like
|
I view this PR as strategically important (as it is needed to unlock the migration of functions out of the core) I think figuring out how to get |
| { | ||
| let schema = self | ||
| .info | ||
| .schema() |
There was a problem hiding this comment.
I think we can get the DFSchema from SimplifyContext here.
Thank you @alamb , I don't mind that. |
| out_expr.rewrite(self)? | ||
| } | ||
|
|
||
| Expr::ScalarFunction(ScalarFunction { |
I was thinking we should stay true to the design intent on If we were going to require DFSChema I think we could simply get rid of /// The information necessary to apply algebraic simplification to an
/// [Expr]. See [SimplifyContext] for one concrete implementation.
///
/// This trait exists so that other systems can plug schema
/// information in without having to create `DFSchema` objects. If you
/// have a [`DFSchemaRef`] you can use [`SimplifyContext`]
pub trait SimplifyInfo { |
ScalarUDFImpl::simplfy() API, move SimplifyInfo et al to datafusion_expr
|
LGTM |
|
I plan to resolve the conflicts with this PR later today and merge this PR unless there are any additional comments. |
| // Casting should be done in `simplify`, so we just return the first argument | ||
| fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> { | ||
| assert_eq!(args.len(), 1); | ||
| Ok(args.first().unwrap().clone()) |
There was a problem hiding this comment.
Would it make sense to add ExprSimplifyResult::Replace(Expr) to cover this case, and eliminate UDF call when expression is simplified?
There was a problem hiding this comment.
To put some context to my comment, let's say if we define function f(INT, INT) = $1 + $2 we can eliminate UDF call with Alias($1 + $2, "f(a,b)") and get UDF free plan, which would be easier to distribute across ballista cluster
There was a problem hiding this comment.
I think it is an excellent idea -- I did so in fea82cb
| // Casting should be done in `simplify`, so we just return the first argument | ||
| fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> { | ||
| assert_eq!(args.len(), 1); | ||
| Ok(args.first().unwrap().clone()) |
There was a problem hiding this comment.
I think it is an excellent idea -- I did so in fea82cb
... `DROP FUNCTION` will look for function name in all available registries (udf, udaf, udwf). `remove` may be necessary if UDaF and UDwF do not get `simplify` method from apache#9304.
| Ok(ExprSimplifyResult::Simplified(new_expr)) | ||
| } | ||
|
|
||
| // Casting should be done in `simplify`, so we just return the first argument |
There was a problem hiding this comment.
comment is a bit outdated
|
Here we go 🚀 Thank you again @jayzhan211 for all the help getting this one tee'd up and ready to merge. I think this will be a very powerful (but simple) API |
* Add plugable function factory * cover `DROP FUNCTION` as well ... ... partially, as `SessionState` does not expose unregister_udf at the moment. * update documentation * fix doc test * Address PR comments (code organization) * Address PR comments (factory interface) * fix test after rebase * `remove`'s gone from the trait ... ... `DROP FUNCTION` will look for function name in all available registries (udf, udaf, udwf). `remove` may be necessary if UDaF and UDwF do not get `simplify` method from #9304. * Rename FunctionDefinition and export it ... FunctionDefinition already exists, DefinitionStatement makes more sense. * Update datafusion/expr/src/logical_plan/ddl.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/src/execution/context/mod.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/tests/user_defined/user_defined_scalar_functions.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/expr/src/logical_plan/ddl.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * resolve part of follow up comments * Qualified functions are not supported anymore * update docs and todos * fix clippy * address additional comments * Add sqllogicteset for CREATE/DROP function * Add coverage for DROP FUNCTION IF EXISTS * fix multiline error * revert dialect back to generic in test ... ... as `create function` gets support in latest sqlparser. * fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
| }) => match udf.simplify(args, info)? { | ||
| ExprSimplifyResult::Original(args) => { | ||
| Transformed::yes(Expr::ScalarFunction(ScalarFunction { | ||
| func_def: ScalarFunctionDefinition::UDF(udf), | ||
| args, | ||
| })) | ||
| } | ||
| ExprSimplifyResult::Simplified(expr) => Transformed::no(expr), | ||
| }, |
There was a problem hiding this comment.
Sorry, I'm a bit late to the party on this one but I keep rereading this and I can't help but think the logic is inverted - Original means unchanged which should result in Transformed::no, shouldn't it? And Simplified mean changed which should result in Transformed::yes?
I did a test where I switched the yes -> no, no -> yes and cargo test passes which makes me wonder if this code is indeed getting tested properly 🤔
Or am I completely missing something (very likely)?
There was a problem hiding this comment.
I think you are correct that the logic is inverted. I wil make a PR to correct it.
I did a test where I switched the yes -> no, no -> yes and cargo test passes which makes me wonder if this code is indeed getting tested properly 🤔
It is a good question about why no tests fail if Transformed::no
I think it is because the expr simplifer code doens't do anything different if the expression was transformed or not: https://github.com/apache/arrow-datafusion/blob/f3836a53122e86f2b73c25557deaa5b800e488e9/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L188-L189
It just looks at data:
https://github.com/apache/arrow-datafusion/blob/f3836a53122e86f2b73c25557deaa5b800e488e9/datafusion/common/src/tree_node.rs#L456

Which issue does this PR close?
Closes #9289 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?