Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for ScalarUDFImpl::invoke_with_return_type where the invoke is passed the return type created for the udf instance #13290

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

joseph-isaacs
Copy link

@joseph-isaacs joseph-isaacs commented Nov 7, 2024

Which issue does this PR close?

Closes #12819.

Rationale for this change

There is currently no way to determine the schema-defined nullability of the arrays from args: &[ColumnarValue],from the invoke_batch function. The array nullability Array::is_nullable() (which is available) only checks the array null count.

If the return type (from ScalarUDFImpl::return_type) is passed to the newly defined invoke method then this datatype can be inspected and any nullability information can be inferred.

What changes are included in this PR?

Define a new trait function invoke_batch_with_return_type which defaults to calling to invoke_batch.

Example usage can be seen here.

Are these changes tested?

Are there any user-facing changes?

There are no breaking changes, but a new optional method is added.

…invoke is passed the return type created for the udf instance
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions labels Nov 7, 2024
@joseph-isaacs joseph-isaacs marked this pull request as ready for review November 7, 2024 15:27
&self,
args: &[ColumnarValue],
number_rows: usize,
return_type: &DataType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add InvokeUDFArg struct that includes these to avoid breaking change in the future?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should roll all the values into this struct or just number_rows and return_type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including args

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on the new function name, if we roll these in together. Since this subsumes invoke_batch, would we then want to deprecate that too?

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer single invoke function, but that will be a large change so not required to roll them together.

Copy link
Author

@joseph-isaacs joseph-isaacs Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better name than

 pub fn invoke_batch_with_return_type(
        &self,
        args: InvokeArgs
    ) -> Result<ColumnarValue> {

is needed, maybe invoke_with_args(), I don't like it that much, but not sure on the name.

 pub fn invoke_with_args(
        &self,
        args: InvokeArgs
    ) -> Result<ColumnarValue> {

@findepi
Copy link
Member

findepi commented Nov 11, 2024

In an ideal worlds, UDF would have single invoke method. This is why #13238 / #13064 / #13345 consolidation takes place.
Now, we add new invoke variant. Would we want all implementations to consolidate on this new variant, or would it feel awkward when doing so?
Passing back the return_type: &DataType might be sometimes useful, but generally feels redundant.

In #12819 (comment) i am attempting to clarify why simplify isn't useful for the use-case.

If not simplify, what if we had something like bind_return_type method that's guaranteed to be invoked before invoke / invoke_batch and which may return new UDF instance (if state changes) or nothing (per default impl). This way we would provide exactly and directly what's needed for the use-case, without changing invoke itself.

@joseph-isaacs
Copy link
Author

I would also be happy to scope out another prepare/bind method on a UDF. But It seems like we need a consensus on the preferred solution @findepi @alamb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Access children DataType or return-type in ScalarUDFImpl::invoke
3 participants