Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access children DataType or return-type in ScalarUDFImpl::invoke #12819

Open
joseph-isaacs opened this issue Oct 8, 2024 · 29 comments · May be fixed by #13290
Open

Access children DataType or return-type in ScalarUDFImpl::invoke #12819

joseph-isaacs opened this issue Oct 8, 2024 · 29 comments · May be fixed by #13290
Labels
enhancement New feature or request

Comments

@joseph-isaacs
Copy link

Is your feature request related to a problem or challenge?

I am trying to create a scalar UDF, pack, which operates on struct arrays. It packs many array into a struct array each with a distinct name

pack(("a", arr1), ("b", arr2), ...) -> struct([("a", arr1.data_type), ("b", arr2.data_type), ...])

This has a data type dependent on the input type and nullability. In the method ScalarUDFImpl::invoke I want to return an a struct array with each field having the data type and nullability of the input, however the invoke function only gives the data type of the array not the nullability of the record batch or intermediate children expressions.

I have returned this type information from return_type_from_exprs, I just need to access this in the stateless scalar udf impl.

Describe the solution you'd like

I would like add a new ScalarUDFImpl::invoke_with_data_type (or invoke_with_return_type) method which is given both the evaluated children array (as previously) and also either the previously returned type (from return_type_from_exprs) or the arguments already passed to return_type_from_exprs which could be re-evaluated by invoke. I am open to either, I guess the former seems more performant.

Describe alternatives you've considered

No response

Additional context

I believe this would be a small non-breaking, change, that I am happy to contribute.

Any ideas?

@joseph-isaacs joseph-isaacs added the enhancement New feature or request label Oct 8, 2024
@joseph-isaacs joseph-isaacs changed the title Access children DataType in ScalarUDFImpl::invoke Access children DataType or return-type in ScalarUDFImpl::invoke Oct 8, 2024
@findepi
Copy link
Member

findepi commented Oct 9, 2024

You may want to implement ScalarUDFImpl::simplify which is given nullability info.
From there you'd return a new ScalarUDFImpl instance with the nullability information stored on a field.
When the information is already stored, subsequent simplify call (if any), should return original expression.
Let me know if it works for your case.

At some point we could probably rename simplify to specialize cc @alamb

@alamb
Copy link
Contributor

alamb commented Oct 9, 2024

The suggestion for using simplify is a good one 👍

@joseph-isaacs
Copy link
Author

joseph-isaacs commented Oct 9, 2024

Hey, thanks for your ideas.

This mean that the ExprSimplifier would change the return type of the pack UDF.

Say for instance the return (before simplification) is ("a", int64 nullable), ... after would be ("a", int64 non nullable), ....

This would mean that the simplification is required to run for the UDF to behave correctly, since the non-specialized invoke doesn't know if a specific field is nullable or not and therefore possibly return a RecordBatch with different schema the return type or always return each field as nullable (not ideal).

@joseph-isaacs
Copy link
Author

This would be the (non-breaking) change I am after

@joseph-isaacs
Copy link
Author

@alamb or @findepi. It would be great to get a response here. Thanks

@alamb
Copy link
Contributor

alamb commented Oct 22, 2024

I just read this again (quickly)

I wonder if you could (re) compute the return type from the input types using the existing return_type method:

https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#tymethod.return_type

@joseph-isaacs
Copy link
Author

Thanks for your reply.

What do you think about the above proposed changed?

When, in the creation/execution path, do you mean re-compute the return_type? At invoke time?

I need this type-info when calling invoke (invoke is given a [ColumnarValue]) which is missing the necessarily type information to reconstruct the type and I cannot call return_type, taking [DataType] since [DataType] is not around at invoke time.

@alamb
Copy link
Contributor

alamb commented Oct 22, 2024

When, in the creation/execution path, do you mean re-compute the return_type? At invoke time?

Yes

invoke is given a [ColumnarValue]) which is missing the necessarily type information to reconstruct the type

I think you can get it from: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.ColumnarValue.html#method.data_type

@alamb
Copy link
Contributor

alamb commented Oct 22, 2024

What do you think about the above proposed changed?

I think they are unecessary if we can achieve the same effect without changing the API

@joseph-isaacs
Copy link
Author

I am sorry for my previous response it was not quite what I was asking for.

I think your proposed solution is?

fn invoke(&self, args: &[ColumnarValue]) -> DFResult<ColumnarValue> {
  let arg_types = args.iter().map(|cv| cv.data_type()).collect_vec();
  let return_type = self.return_type(&arg_types)?;
  ...
}

I was hoping to use the result of return_type_from_exprs(self, &[Expr], &dyn ExprSchema, &[DataType]) (in invoke) since I call expr.data_type_and_nullable(schema)? on the inputs to get the correct nullability.

@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

I think your proposed solution is?

Yes that is what I was thinking

I was hoping to use the result of return_type_from_exprs(self, &[Expr], &dyn ExprSchema, &[DataType]) (in invoke) since I call expr.data_type_and_nullable(schema)? on the inputs to get the correct nullability.

Ah, I see -- the nullability isn't available from the DataTypes.

🤔

@joseph-isaacs
Copy link
Author

joseph-isaacs commented Oct 25, 2024

Unfortunately not. Do you have any other ideas?

@alamb
Copy link
Contributor

alamb commented Oct 26, 2024

🤔 It seems like the more general thing to do would be to change return-type to return a Field rather than a DataType -- maybe we could add something backwards compatible

@findepi
Copy link
Member

findepi commented Oct 26, 2024

Field has a name attribute, which function top level return value doesn't have (function result is unnamed).
Top level nullability seems reported by ScalarUDFImpl::is_nullable.
For complex types (lists, structs), they already involve fields, and those fields can be marked nullable or non-null.

@joseph-isaacs
Copy link
Author

joseph-isaacs commented Oct 28, 2024

It would be great to figure out a way to pass this nullability info (or return) the invoke at execution time. Do you think any of the suggestion in the initial post would be a good idea?

It seems like the nullability wasn't fully considered for function since the is_nullable and return_type_from_exprs where added later, would be great have this info throughout the trait.

@alamb
Copy link
Contributor

alamb commented Oct 29, 2024

It would be great to figure out a way to pass this nullability info (or return) the invoke at execution time. Do you think any of the suggestion in the initial post would be a good idea?

I worry about making too many different combinations of invoke / invoke_no_args / invoke_nullable, etc

It seems like the nullability wasn't fully considered for function since the is_nullable and return_type_from_exprs where added later, would be great have this info throughout the trait.

I agree nullability was not fully considered and having a proper nullability story would be good.

I think one thing that happened is that much of DataFusion's code assumes the input can be nullable and then switches to special case no-null code at runtime based on Array::null_count rather than relying on the schema

That made ensuring that the schema's nullability aligned wasn't as important and so there are a bunch of places which are fast and loose with it

@joseph-isaacs
Copy link
Author

Any chance we can make a decision as if this will be supported or not?

@alamb
Copy link
Contributor

alamb commented Nov 5, 2024

I think there is a compelling story to be made if there is no way to implement a feature with the current API

So what I would recommend is

  1. Create a PR with an example showing what you are trying to do and any API change you want to propose

Then we can perhaps have a more specific conversation about the specific APIs and how else you might be able to solve your challenge.

@joseph-isaacs
Copy link
Author

joseph-isaacs commented Nov 6, 2024

@alamb
Copy link
Contributor

alamb commented Nov 6, 2024

Cool, https://github.com/spiraldb/datafusion/pull/1/files

Thanks!

Is there any way you can add an example of the pack UDF that shows how this API would be used?

@joseph-isaacs
Copy link
Author

joseph-isaacs commented Nov 7, 2024

Here is a example of the usage #13289, with the API updated to invoke_batch

@joseph-isaacs
Copy link
Author

Here is the draft PR #13290

@findepi
Copy link
Member

findepi commented Nov 11, 2024

This mean that the ExprSimplifier would change the return type of the pack UDF.

Say for instance the return (before simplification) is ("a", int64 nullable), ... after would be ("a", int64 non nullable), ....

I don't see yet that it would

What is this analysis missing?

@gatesn
Copy link

gatesn commented Nov 11, 2024

I think the objection is that, by all understanding, simplify is an optional optimization step. The result of an expression shouldn't change based on whether it was run.

In your proposal, simplification is required for a correct result to be produced.

@findepi
Copy link
Member

findepi commented Nov 11, 2024

That's a good point, @gatesn!

What if we had similar (same?) operation, but with different guarantees (ie mandatory step). Could be called "prepare" for example. Would it help?

BTW, I would expect simplify() to be guaranteed. For example, if simplify replaces my (hypothetical) my_like_udf with eg regexp_like, i shouldn't need to provide implementation of the my_like_udf. If it's not supposed to be called, it would be dead code, useless impl burden. Does it make sense?

@gatesn
Copy link

gatesn commented Nov 11, 2024

Yes that makes sense.

Perhaps some of the conflict comes from the fact that we're using DataFusion's expression library without the LogicalPlan or any of the relational algebra. It is mostly working OK, we have found a few structs that could do with being made public! Is this use-case intended to be supported by DataFusion? Or should we be wrapping everything up in a LogicalPlan for execution?

I mention this because, if we are expected to go via LogicalPlans for execution, then I would think that simplification is far more guaranteed to run than if we construct and evaluate the expressions ourselves.

As a side note, all this annoyance ultimately comes from nullability living inside a Field instead of inside a DataType. But this is an Arrow decision we have to live with 🤷 (or do we...?)

I'm not quite sure why adding a new prepare function would be less disruptive to the API than passing in the return_type to invoke? Perhaps making it invoke(args: InvokeArgs) would protect against future unforeseen API breaks?

@alamb
Copy link
Contributor

alamb commented Nov 13, 2024

As a side note, all this annoyance ultimately comes from nullability living inside a Field instead of inside a DataType. But this is an Arrow decision we have to live with 🤷 (or do we...?)

Yes I agree.

I'm not quite sure why adding a new prepare function would be less disruptive to the API than passing in the return_type to invoke? Perhaps making it invoke(args: InvokeArgs) would protect against future unforeseen API breaks?

One reason we are likely a bit worried about changing the API again is that we are in the process of changing invoke already:

So churning it again will be annoying downstream. That being said, I think it may be worth it if we follow your suggestion to make it args: InvokeArgs. This helps to:

  1. to protect against other API changes
  2. Is consistent with AccumulatorArgs

@joseph-isaacs
Copy link
Author

I guess changing the API again soon might reduce the number of people that migrate twice?

@alamb
Copy link
Contributor

alamb commented Nov 13, 2024

I guess changing the API again soon might reduce the number of people that migrate twice?

Unfortunately, since we just released DataFusion 43.0.0 the new API has been released now (though the old API is marked as deprecated which I think eases the upgrade pain -- it gives them a release to do the upgrade)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants