-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access children DataType
or return-type in ScalarUDFImpl::invoke
#12819
Comments
DataType
in ScalarUDFImpl::invoke
DataType
or return-type in ScalarUDFImpl::invoke
You may want to implement At some point we could probably rename |
The suggestion for using |
Hey, thanks for your ideas. This mean that the Say for instance the return (before simplification) is This would mean that the simplification is required to run for the UDF to behave correctly, since the non-specialized invoke doesn't know if a specific field is nullable or not and therefore possibly return a RecordBatch with different schema the return type or always return each field as nullable (not ideal). |
This would be the (non-breaking) change I am after |
I just read this again (quickly) I wonder if you could (re) compute the return type from the input types using the existing |
Thanks for your reply. What do you think about the above proposed changed? When, in the creation/execution path, do you mean re-compute the I need this type-info when calling invoke (invoke is given a |
Yes
I think you can get it from: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.ColumnarValue.html#method.data_type |
I think they are unecessary if we can achieve the same effect without changing the API |
I am sorry for my previous response it was not quite what I was asking for. I think your proposed solution is? fn invoke(&self, args: &[ColumnarValue]) -> DFResult<ColumnarValue> {
let arg_types = args.iter().map(|cv| cv.data_type()).collect_vec();
let return_type = self.return_type(&arg_types)?;
...
} I was hoping to use the result of |
Yes that is what I was thinking
Ah, I see -- the nullability isn't available from the DataTypes. 🤔 |
Unfortunately not. Do you have any other ideas? |
🤔 It seems like the more general thing to do would be to change |
|
It would be great to figure out a way to pass this nullability info (or return) the invoke at execution time. Do you think any of the suggestion in the initial post would be a good idea? It seems like the nullability wasn't fully considered for function since the |
I worry about making too many different combinations of invoke / invoke_no_args / invoke_nullable, etc
I agree nullability was not fully considered and having a proper nullability story would be good. I think one thing that happened is that much of DataFusion's code assumes the input can be nullable and then switches to special case no-null code at runtime based on That made ensuring that the schema's nullability aligned wasn't as important and so there are a bunch of places which are fast and loose with it |
Any chance we can make a decision as if this will be supported or not? |
I think there is a compelling story to be made if there is no way to implement a feature with the current API So what I would recommend is
Then we can perhaps have a more specific conversation about the specific APIs and how else you might be able to solve your challenge. |
Thanks! Is there any way you can add an example of the |
Here is a example of the usage #13289, with the API updated to |
Here is the draft PR #13290 |
I don't see yet that it would
What is this analysis missing? |
I think the objection is that, by all understanding, In your proposal, simplification is required for a correct result to be produced. |
That's a good point, @gatesn! What if we had similar (same?) operation, but with different guarantees (ie mandatory step). Could be called "prepare" for example. Would it help? BTW, I would expect |
Yes that makes sense. Perhaps some of the conflict comes from the fact that we're using DataFusion's expression library without the LogicalPlan or any of the relational algebra. It is mostly working OK, we have found a few structs that could do with being made public! Is this use-case intended to be supported by DataFusion? Or should we be wrapping everything up in a LogicalPlan for execution? I mention this because, if we are expected to go via LogicalPlans for execution, then I would think that simplification is far more guaranteed to run than if we construct and evaluate the expressions ourselves. As a side note, all this annoyance ultimately comes from nullability living inside a Field instead of inside a DataType. But this is an Arrow decision we have to live with 🤷 (or do we...?) I'm not quite sure why adding a new |
Yes I agree.
One reason we are likely a bit worried about changing the API again is that we are in the process of changing invoke already: So churning it again will be annoying downstream. That being said, I think it may be worth it if we follow your suggestion to make it
|
I guess changing the API again soon might reduce the number of people that migrate twice? |
Unfortunately, since we just released DataFusion 43.0.0 the new API has been released now (though the old API is marked as deprecated which I think eases the upgrade pain -- it gives them a release to do the upgrade) |
Is your feature request related to a problem or challenge?
I am trying to create a scalar UDF, pack, which operates on struct arrays. It packs many array into a struct array each with a distinct name
pack(("a", arr1), ("b", arr2), ...) -> struct([("a", arr1.data_type), ("b", arr2.data_type), ...])
This has a data type dependent on the input type and nullability. In the method
ScalarUDFImpl::invoke
I want to return an a struct array with each field having the data type and nullability of the input, however the invoke function only gives the data type of the array not the nullability of the record batch or intermediate children expressions.I have returned this type information from
return_type_from_exprs
, I just need to access this in the stateless scalar udf impl.Describe the solution you'd like
I would like add a new
ScalarUDFImpl::invoke_with_data_type
(orinvoke_with_return_type
) method which is given both the evaluated children array (as previously) and also either the previously returned type (fromreturn_type_from_exprs
) or the arguments already passed toreturn_type_from_exprs
which could be re-evaluated by invoke. I am open to either, I guess the former seems more performant.Describe alternatives you've considered
No response
Additional context
I believe this would be a small non-breaking, change, that I am happy to contribute.
Any ideas?
The text was updated successfully, but these errors were encountered: