-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move coalesce to datafusion-functions and remove BuiltInScalarFunction #10098
Conversation
@@ -1276,7 +1260,7 @@ impl Expr { | |||
pub fn short_circuits(&self) -> bool { | |||
match self { | |||
Expr::ScalarFunction(ScalarFunction { func_def, .. }) => { | |||
matches!(func_def, ScalarFunctionDefinition::BuiltIn(fun) if *fun == BuiltinScalarFunction::Coalesce) | |||
matches!(func_def, ScalarFunctionDefinition::UDF(fun) if fun.name().eq("coalesce")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a good way to directly refer to the coalesce function here so went with checking by name. There is another by-name check in /physical-expr/src/scalar_function.rs for make_array so I believe this is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I"m thinking if its possible if user creates udf with coalesce
name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that and my thought was that if that was done it would replace the existing coalesce function and one would think it would (hopefully) have similar functionality. If there is another way to determine it here it should be a reasonable change. Just importing the crate though isn't possible as best as I can tell because that would cause a circular dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the new API is_short_circuits()
(default to false, when need set to true) to ScalarUDF
and ScalarUDFImpl
might be a good way to do this because users may want to define their own short-circuit functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an option but I think if we go that route it should be it's own PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that @haohuaijin 's idea is a good one (and keeping with the spirit of "anything we can do with built in functions can be done with ScalarAPIs"). I filed #10162 to track
input_schema, | ||
&execution_props, | ||
)? | ||
return Err(proto_error(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the only value left for ScalarFunction in the datafusion.proto is 'unknown' this implementation seemed reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could remove the ScalarFunction part of proto too
In a follow on PR we could probably remove
datafusion/datafusion/proto/proto/datafusion.proto
Lines 541 to 686 in 70db5ea
enum ScalarFunction { | |
// 0 was Abs before | |
// The first enum value must be zero for open enums | |
unknown = 0; | |
// 1 was Acos | |
// 2 was Asin | |
// 3 was Atan | |
// 4 was Ascii | |
// 5 was Ceil | |
// 6 was Cos | |
// 7 was Digest | |
// 8 was Exp | |
// 9 was Floor | |
// 10 was Ln | |
// 11 was Log | |
// 12 was Log10 | |
// 13 was Log2 | |
// 14 was Round | |
// 15 was Signum | |
// 16 was Sin | |
// 17 was Sqrt | |
// Tan = 18; | |
// 19 was Trunc | |
// 20 was Array | |
// RegexpMatch = 21; | |
// 22 was BitLength | |
// 23 was Btrim | |
// 24 was CharacterLength | |
// 25 was Chr | |
// 26 was Concat | |
// 27 was ConcatWithSeparator | |
// 28 was DatePart | |
// 29 was DateTrunc | |
// 30 was InitCap | |
// 31 was Left | |
// 32 was Lpad | |
// 33 was Lower | |
// 34 was Ltrim | |
// 35 was MD5 | |
// 36 was NullIf | |
// 37 was OctetLength | |
// 38 was Random | |
// 39 was RegexpReplace | |
// 40 was Repeat | |
// 41 was Replace | |
// 42 was Reverse | |
// 43 was Right | |
// 44 was Rpad | |
// 45 was Rtrim | |
// 46 was SHA224 | |
// 47 was SHA256 | |
// 48 was SHA384 | |
// 49 was SHA512 | |
// 50 was SplitPart | |
// StartsWith = 51; | |
// 52 was Strpos | |
// 53 was Substr | |
// ToHex = 54; | |
// 55 was ToTimestamp | |
// 56 was ToTimestampMillis | |
// 57 was ToTimestampMicros | |
// 58 was ToTimestampSeconds | |
// 59 was Now | |
// 60 was Translate | |
// Trim = 61; | |
// Upper = 62; | |
Coalesce = 63; | |
// 64 was Power | |
// 65 was StructFun | |
// 66 was FromUnixtime | |
// 67 Atan2 | |
// 68 was DateBin | |
// 69 was ArrowTypeof | |
// 70 was CurrentDate | |
// 71 was CurrentTime | |
// 72 was Uuid | |
// 73 was Cbrt | |
// 74 Acosh | |
// 75 was Asinh | |
// 76 was Atanh | |
// 77 was Sinh | |
// 78 was Cosh | |
// Tanh = 79 | |
// 80 was Pi | |
// 81 was Degrees | |
// 82 was Radians | |
// 83 was Factorial | |
// 84 was Lcm | |
// 85 was Gcd | |
// 86 was ArrayAppend | |
// 87 was ArrayConcat | |
// 88 was ArrayDims | |
// 89 was ArrayRepeat | |
// 90 was ArrayLength | |
// 91 was ArrayNdims | |
// 92 was ArrayPosition | |
// 93 was ArrayPositions | |
// 94 was ArrayPrepend | |
// 95 was ArrayRemove | |
// 96 was ArrayReplace | |
// 97 was ArrayToString | |
// 98 was Cardinality | |
// 99 was ArrayElement | |
// 100 was ArraySlice | |
// 103 was Cot | |
// 104 was ArrayHas | |
// 105 was ArrayHasAny | |
// 106 was ArrayHasAll | |
// 107 was ArrayRemoveN | |
// 108 was ArrayReplaceN | |
// 109 was ArrayRemoveAll | |
// 110 was ArrayReplaceAll | |
// 111 was Nanvl | |
// 112 was Flatten | |
// 113 was IsNan | |
// 114 was Iszero | |
// 115 was ArrayEmpty | |
// 116 was ArrayPopBack | |
// 117 was StringToArray | |
// 118 was ToTimestampNanos | |
// 119 was ArrayIntersect | |
// 120 was ArrayUnion | |
// 121 was OverLay | |
// 122 is Range | |
// 123 is ArrayExcept | |
// 124 was ArrayPopFront | |
// 125 was Levenshtein | |
// 126 was SubstrIndex | |
// 127 was FindInSet | |
// 128 was ArraySort | |
// 129 was ArrayDistinct | |
// 130 was ArrayResize | |
// 131 was EndsWith | |
// 132 was InStr | |
// 133 was MakeDate | |
// 134 was ArrayReverse | |
// 135 is RegexpLike | |
// 136 was ToChar | |
// 137 was ToDate | |
// 138 was ToUnixtime | |
} | |
message ScalarFunctionNode { | |
ScalarFunction fun = 1; | |
repeated LogicalExprNode args = 2; | |
} |
And the references to ScalarFunctionNode
|
||
let ctx = SessionContext::new(); | ||
let lit = Expr::Literal(ScalarValue::Utf8(None)); | ||
for builtin_fun in BuiltinScalarFunction::iter() { | ||
for function in string::functions() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using all the string functions here seemed like a reasonable choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I didn't remove ScalarFunction from this file as I think that is a change best left to another PR.
It's the big milestone Thanks, @Omega359 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @Omega359 -- it has been quite a ride but it is amazing to see this project at the completion.
I updated the description of this PR to close a few more tickets as well as filed a few follow on cleanups I think we can do.
I'll plan to merge this PR tomorrow unless anyone else would like time to review
Thanks again to everyone who helped make this happen
/// Resolved to a `BuiltinScalarFunction` | ||
/// There is plan to migrate `BuiltinScalarFunction` to UDF-based implementation (issue#8045) | ||
/// This variant is planned to be removed in long term | ||
BuiltIn(BuiltinScalarFunction), | ||
/// Resolved to a user defined function | ||
UDF(Arc<crate::ScalarUDF>), | ||
/// A scalar function constructed with name. This variant can not be executed directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, I don't think Name is ever used anymore -- we could eventually simply remove ScalarFunctionDefinition
to avoid another level of wrapping.
However, I think there has been enough API churn for a while. We can maybe clean that up as a follow on PR in a few months or somthing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On argument against keeping it: if it's unused and you keep it around, someone will likely use it again and make the removal even harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in #10175
@@ -1276,7 +1260,7 @@ impl Expr { | |||
pub fn short_circuits(&self) -> bool { | |||
match self { | |||
Expr::ScalarFunction(ScalarFunction { func_def, .. }) => { | |||
matches!(func_def, ScalarFunctionDefinition::BuiltIn(fun) if *fun == BuiltinScalarFunction::Coalesce) | |||
matches!(func_def, ScalarFunctionDefinition::UDF(fun) if fun.name().eq("coalesce")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that @haohuaijin 's idea is a good one (and keeping with the spirit of "anything we can do with built in functions can be done with ScalarAPIs"). I filed #10162 to track
@@ -0,0 +1,141 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally expect to find coalesce in the core
functions module rather than the math
module, but we could do that as a follow on PR (or never).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea why I put this in the math module. 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in #10174
(true, Err(_)) | ||
if self.supports_zero_argument && self.name != "make_array" => | ||
{ | ||
// MakeArray support zero argument but has the different behavior from the array with one null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a strange special case (that might be left over). @jayzhan211 or @Weijun-H do you remember if this is still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should even return empty vec instead of null array here. Those logic should be handled in each function case by case to avoid adding specific assumptions to all the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File #10193 to fix this
@@ -412,16 +411,6 @@ impl From<&protobuf::StringifiedPlan> for StringifiedPlan { | |||
} | |||
} | |||
|
|||
impl From<&protobuf::ScalarFunction> for BuiltinScalarFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the correct change -- now all functions are resolved by name.
While this is a breaking API change, most other changes to the protobuf definitions are breaking changes -- see
datafusion/datafusion/proto/src/lib.rs
Lines 37 to 42 in 70db5ea
//! # Version Compatibility | |
//! | |
//! The serialized form are not guaranteed to be compatible across | |
//! DataFusion versions. A plan serialized with one version of DataFusion | |
//! may not be able to deserialized with a different version. | |
//! |
input_schema, | ||
&execution_props, | ||
)? | ||
return Err(proto_error(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could remove the ScalarFunction part of proto too
In a follow on PR we could probably remove
datafusion/datafusion/proto/proto/datafusion.proto
Lines 541 to 686 in 70db5ea
enum ScalarFunction { | |
// 0 was Abs before | |
// The first enum value must be zero for open enums | |
unknown = 0; | |
// 1 was Acos | |
// 2 was Asin | |
// 3 was Atan | |
// 4 was Ascii | |
// 5 was Ceil | |
// 6 was Cos | |
// 7 was Digest | |
// 8 was Exp | |
// 9 was Floor | |
// 10 was Ln | |
// 11 was Log | |
// 12 was Log10 | |
// 13 was Log2 | |
// 14 was Round | |
// 15 was Signum | |
// 16 was Sin | |
// 17 was Sqrt | |
// Tan = 18; | |
// 19 was Trunc | |
// 20 was Array | |
// RegexpMatch = 21; | |
// 22 was BitLength | |
// 23 was Btrim | |
// 24 was CharacterLength | |
// 25 was Chr | |
// 26 was Concat | |
// 27 was ConcatWithSeparator | |
// 28 was DatePart | |
// 29 was DateTrunc | |
// 30 was InitCap | |
// 31 was Left | |
// 32 was Lpad | |
// 33 was Lower | |
// 34 was Ltrim | |
// 35 was MD5 | |
// 36 was NullIf | |
// 37 was OctetLength | |
// 38 was Random | |
// 39 was RegexpReplace | |
// 40 was Repeat | |
// 41 was Replace | |
// 42 was Reverse | |
// 43 was Right | |
// 44 was Rpad | |
// 45 was Rtrim | |
// 46 was SHA224 | |
// 47 was SHA256 | |
// 48 was SHA384 | |
// 49 was SHA512 | |
// 50 was SplitPart | |
// StartsWith = 51; | |
// 52 was Strpos | |
// 53 was Substr | |
// ToHex = 54; | |
// 55 was ToTimestamp | |
// 56 was ToTimestampMillis | |
// 57 was ToTimestampMicros | |
// 58 was ToTimestampSeconds | |
// 59 was Now | |
// 60 was Translate | |
// Trim = 61; | |
// Upper = 62; | |
Coalesce = 63; | |
// 64 was Power | |
// 65 was StructFun | |
// 66 was FromUnixtime | |
// 67 Atan2 | |
// 68 was DateBin | |
// 69 was ArrowTypeof | |
// 70 was CurrentDate | |
// 71 was CurrentTime | |
// 72 was Uuid | |
// 73 was Cbrt | |
// 74 Acosh | |
// 75 was Asinh | |
// 76 was Atanh | |
// 77 was Sinh | |
// 78 was Cosh | |
// Tanh = 79 | |
// 80 was Pi | |
// 81 was Degrees | |
// 82 was Radians | |
// 83 was Factorial | |
// 84 was Lcm | |
// 85 was Gcd | |
// 86 was ArrayAppend | |
// 87 was ArrayConcat | |
// 88 was ArrayDims | |
// 89 was ArrayRepeat | |
// 90 was ArrayLength | |
// 91 was ArrayNdims | |
// 92 was ArrayPosition | |
// 93 was ArrayPositions | |
// 94 was ArrayPrepend | |
// 95 was ArrayRemove | |
// 96 was ArrayReplace | |
// 97 was ArrayToString | |
// 98 was Cardinality | |
// 99 was ArrayElement | |
// 100 was ArraySlice | |
// 103 was Cot | |
// 104 was ArrayHas | |
// 105 was ArrayHasAny | |
// 106 was ArrayHasAll | |
// 107 was ArrayRemoveN | |
// 108 was ArrayReplaceN | |
// 109 was ArrayRemoveAll | |
// 110 was ArrayReplaceAll | |
// 111 was Nanvl | |
// 112 was Flatten | |
// 113 was IsNan | |
// 114 was Iszero | |
// 115 was ArrayEmpty | |
// 116 was ArrayPopBack | |
// 117 was StringToArray | |
// 118 was ToTimestampNanos | |
// 119 was ArrayIntersect | |
// 120 was ArrayUnion | |
// 121 was OverLay | |
// 122 is Range | |
// 123 is ArrayExcept | |
// 124 was ArrayPopFront | |
// 125 was Levenshtein | |
// 126 was SubstrIndex | |
// 127 was FindInSet | |
// 128 was ArraySort | |
// 129 was ArrayDistinct | |
// 130 was ArrayResize | |
// 131 was EndsWith | |
// 132 was InStr | |
// 133 was MakeDate | |
// 134 was ArrayReverse | |
// 135 is RegexpLike | |
// 136 was ToChar | |
// 137 was ToDate | |
// 138 was ToUnixtime | |
} | |
message ScalarFunctionNode { | |
ScalarFunction fun = 1; | |
repeated LogicalExprNode args = 2; | |
} |
And the references to ScalarFunctionNode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks again @Omega359 and everyone else involved in making this happen (and welcome back @liukun4515, it is great to see you in the repo again!) |
Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372
These relied on upstream BuiltinScalarFunction, which are now removed. Ref apache/datafusion#10098
* chore: upgrade datafusion Deps Ref #690 * update concat and concat_ws to use datafusion_functions Moved in apache/datafusion#10089 * feat: upgrade functions.rs Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372 * fix ScalarUDF import * feat: remove deprecated suppors_filter_pushdown and impl supports_filters_pushdown Deprecated function removed in apache/datafusion#9923 * use `unnest_columns_with_options` instead of deprecated `unnest_column_with_option` * remove ScalarFunction wrappers These relied on upstream BuiltinScalarFunction, which are now removed. Ref apache/datafusion#10098 * update dataframe `test_describe` `null_count` was fixed upstream. Ref apache/datafusion#10260 * remove PyDFField and related methods DFField was removed upstream. Ref: apache/datafusion#9595 * bump `datafusion-python` package version to 38.0.0 * re-implement `PyExpr::column_name` The previous implementation relied on `DFField` which was removed upstream. Ref: apache/datafusion#9595
Which issue does this PR close?
Closes #10090
Closes #9285
Closes #8045
Rationale for this change
Complete builtin scalar function migration
What changes are included in this PR?
Source, tests.
Are these changes tested?
Yes, all tests and benchmarks pass. I would like reviewers to please go over this in detail as a lot was changed and removed in this PR. I want to especially note the changes in the proto module for which I am uncertain as to whether backwards compatibility is a requirement (and if so, it definitely will break that).
Are there any user-facing changes?
BuiltinScalarFunction is no more. I'm pretty sure that will be a breaking change for many users that extend DF.