-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
During work on adding a 'safe' mode to to_timestamp and to_date UDF functions I've come across an issue that I would like feedback on before proceeding.
The feature
Currently for timestamp and date parsing if a source string cannot be parsed using any of the provided chrono formats datafusion will return an error. This is normal for a database-type solution however it is not ideal for a system that is parsing billions of dates in batches - some of which are human entered. Systems such as Spark default to a null value for anything that cannot be parsed and this feature enables a mode ('safe' to mirror the name and same behaviour as CastOptions) that allows the to_timestamp* and to_date UDFs to have the same behaviour (return null on error).
The problem
Since UDF functions have no context provided and there isn't a way I know of to statically get access to config to add the above mentioned functionality I resorted to using a new constructor function to allow the UDF to switch behaviour:
#[derive(Debug)]
pub struct ToTimestampFunc {
signature: Signature,
/// how to handle cast or parsing failures, either return NULL (safe=true) or return ERR (safe=false)
pub safe: bool,
}
impl ToTimestampFunc {
pub fn new() -> Self {
Self {
signature: Signature::variadic_any(Volatility::Immutable),
safe: false,
}
}
pub fn new_with_safe(safe: bool) -> Self {
Self {
signature: Signature::variadic_any(Volatility::Immutable),
safe,
}
}
}
To use the alternative 'safe' mode for these functions is as simple as
let to_timestamp = ToTimestampFunc::new_with_safe(true);
session_context.register_udf(ScalarUDF::new_from_impl(to_timestamp));
Unfortunately this only affect sql queries - any calls to the to_timestamp(args: Vec) function will not use the new definition as registered in the function registry. This is because that function and every other function like it use a static singleton instance that only uses a ::new() call to initial it and there is no way that I can see to replace that instance.
Describe the solution you'd like
I see a few possible solutions to this:
- Acknowledge the difference in behaviour in documentation and recommend using
ctx.udf("to_timestamp").unwrap().call(args)instead of the to_timestamp() function anytime 'safe' mode is required. This is less than ideal imho as it can lead to confusion and unintuitive behavior. - Provide a mechanism (function argument to invoke perhaps) to provide a SessionContext to all UDF's so that they can adjust behaviour based on configuration at call time.
- Update the to_timestamp() function and all similar functions to have alternatives that do not use the statically defined version of the UDF but rather create it on-demand or use the definition saved in the session state. As examples:
// the existing function
pub fn to_timestamp(args: Vec<Expr>) -> Expr {
super::to_timestamp().call(args)
}
pub fn to_timestamp_safe(args: Vec<Expr>) -> Expr {
ScalarUDF::new_from_impl(ToTimestampFunc::new_with_safe(true)).call(args)
}
// or
pub fn to_timestamp_from_context(ctx &Arc<SessionContext>, args: Vec<Expr>) -> Expr {
ctx.udf("to_timestamp").unwrap().call(args)
}
// varargs would have been a really nice solution as well but 🤷
- something else I haven't thought of.
Any opinions, suggestions and critiques would be welcome.
Describe alternatives you've considered
No response
Additional context
No response