-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create physical scalar expression in functions
modules from string (name)
#9892
Comments
I think you can do this via one of these functions (I can't link to rustdocs as 37 hasn't yet been released and the functions are made via macro): |
Hmm, I think you mean that fn string_to_scalar_udf(udf_name: &str) -> Arc<ScalarUDF> {
match udf_name {
"ascii" -> datafusion_functions::string::ascii(),
...
}
} It is doable thought it means we need to maintain a long list of matches, but I'm also wondering if there is a better built-in solution in DataFusion. |
The names of the functions would be in the list of udf's available in the sessionState
|
Oh, Although it is still not creating the As I described in the description, for queries which don't use most of these scalar udfs, it looks unnecessary to create and hold these I think a better approach should be to create |
Ah, I misread the description, apologies. It's an interesting idea. I don't see an easy solution right now especially since the udf's are held in a singleton once called (see https://github.com/apache/arrow-datafusion/blob/d8d521ac8b90002fa0ba1f91456051a9775ae193/datafusion/functions/src/macros.rs#L66) so any memory savings from not using the udf would evaporate after the first use. I think it may be possible with a change to the make_udf_function to remove the singleton code however I'm not sure of the wisdom of that. |
I think adding a function that did this matching to datafusion seems like a good idea to me. You could probably write a test to ensure it was kept in sync with the list of functions pretty easily.
What is the concern about creating a bunch of |
Yea, I suppose that these |
BTW while updating influxdb we ended up with a bunch of the following in our code. I think the API @viirya is describing on this ticket would help us too let date_bin = datafusion::functions::datetime::functions()
.iter()
.find(|fun| fun.name().eq("date_bin"))
.expect("should have date_bin UDF")
.to_owned(); |
functions
modules from string functions
modules from string (name)
How about we introduce If I understand correctly, only at the point we call the closure, we will run this function. pub fn $SCALAR_UDF_FN() -> std::sync::Arc<datafusion_expr::ScalarUDF> {
[< STATIC_ $UDF >]
.get_or_init(|| {
std::sync::Arc::new(datafusion_expr::ScalarUDF::new_from_impl(
<$UDF>::new(),
))
})
.clone()
} https://users.rust-lang.org/t/cost-of-creating-closure-vs-cost-of-creating-struct/33101/5 |
Note that the use of name -> function is used internally within scalar_function.rs: Having a good name -> fn lookup mechanism would help here I think. |
I was thinking maybe we can even avoid new types / macros with a function like this: fn get_udf(name: &str) -> Option<Arc<ScalarUDF>> {
// build hash table of deferred functions that create functions on demand
// (note this hash table would probably be built in a static OnceLock or something)
let functions: HashMap<_, _> = [
("abs", Box::new(|| crate::math::abs()),
("sin", Box::new(|| crate::math::sin()),
("date_bin", Box::new(|| crate::datetime::date_bin()),
...
].into_iter().collect();
functions.get(name).map(|factory| (factor)())
} Then we could then ensure this was in sync with a test something like for expected_function in datafusion_functions::all_functions() {
assert!(get_udf(expected_function.name()).is_some())
} |
I can work on this. |
Thanks @jayzhan211 I think both patterns are needed -- one that gets all UDFs (what register_all currently does) as well as a "get me just a single udf" |
fn get_function_meta(&self, name: &str) -> Option<Arc<ScalarUDF>> {
self.state.scalar_functions().get(name).cloned()
} I was considering replacing the |
I draft the idea here, I think the old function can be deprecated, but it will be an API change design, should we deprecate them? use std::sync::OnceLock;
pub type ScalarFactory = Box<dyn Fn() -> Arc<ScalarUDF> + Send + Sync>;
/// HashMap Singleton for UDFs
///
/// Replace register_all with our built-in functions
/// Replace scalar_functions: HashMap<String, Arc<ScalarUDF>> in SessionState
pub fn scalar_functions() -> &'static Mutex<HashMap<String, ScalarFactory>> {
static FUNCTIONS: OnceLock<Mutex<HashMap<String, ScalarFactory>>> = OnceLock::new();
FUNCTIONS.get_or_init(|| {
let mut functions = HashMap::new();
functions.insert(
String::from("array_to_string"),
Box::new(string::array_to_string_udf) as _,
);
// TODO: Add more builtin functions here
Mutex::new(functions)
})
}
// Get an UDF by name
//
// Replace with `get_udf`
// fn get_function_meta(&self, name: &str) -> Option<Arc<ScalarUDF>> {
// self.state.scalar_functions().get(name).cloned()
// }
pub fn get_udf(name: &str) -> Option<Arc<ScalarUDF>> {
scalar_functions().lock().unwrap().get(name).map(|f| f())
}
/// Register a single new UDF, so the user can register their own functions
///
/// Repalce old regsiter_udf
pub fn register_udf(name: &str, udf: ScalarFactory) -> Option<ScalarFactory> {
scalar_functions().lock().unwrap().insert(name.to_string(), udf)
} |
I am not sure about this -- I think it would be better if the data in the module was static / not mutable at runtime. If users want to register their own functions, they can do so in a
How about we try what having both APIs would look like? Maybe it is too much duplication, but I bet it would be pretty minimal |
Is your feature request related to a problem or challenge?
Create physical scalar expression in functions modules from string
Currently more and more built-in scalar functions are moved to
functions
modules, e.g. #9435. It avoids using a long enum of all built-in scalar functions which is hard to maintain. But for Comet, we rely the ability to create a physical scalar expression from string (e.g.,datepart
).Previously it is easy and just calls
BuiltinScalarFunction::from_str
to getBuiltinScalarFunction
. But now I don't see such convenient function to do that.FunctionRegistry
providesudf
which can return a reference toScalarUDF
. But it requires these UDFs must be registered. As we don't know what UDFs will be used, we need to register all built-in UDFs in the registry. The flaw is, it will createScalarUDF
s for all built-in UDFs even they are not actually used in the queries.I think we still need an approach that can simply create a physical scalar expression in
functions
modules from string. So we can create correspondingScalarUDF
on demand.Another approach might be to avoid creating
ScalarUDF
s when registering built-in scalar functions.Avoid creating
ScalarUDF
s before they are actually used when registering inFunctionRegistry
Actually, I am also wondering if it is necessary to create and register all these
ScalaUDF
s in DataFusion'sFunctionRegistry
before these scalar UDFs are actually used.For example, Spark's
FunctionRegistry
registers expression builders instead of creating actual expressions when registering built-in expressions. A built-in expression is created only if it is actually used by a query.Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: