Skip to content

Feedback request for providing configurable UDF functions #10744

@Omega359

Description

@Omega359

Is your feature request related to a problem or challenge?

During work on adding a 'safe' mode to to_timestamp and to_date UDF functions I've come across an issue that I would like feedback on before proceeding.

The feature

Currently for timestamp and date parsing if a source string cannot be parsed using any of the provided chrono formats datafusion will return an error. This is normal for a database-type solution however it is not ideal for a system that is parsing billions of dates in batches - some of which are human entered. Systems such as Spark default to a null value for anything that cannot be parsed and this feature enables a mode ('safe' to mirror the name and same behaviour as CastOptions) that allows the to_timestamp* and to_date UDFs to have the same behaviour (return null on error).

The problem

Since UDF functions have no context provided and there isn't a way I know of to statically get access to config to add the above mentioned functionality I resorted to using a new constructor function to allow the UDF to switch behaviour:

#[derive(Debug)]
pub struct ToTimestampFunc {
    signature: Signature,
    /// how to handle cast or parsing failures, either return NULL (safe=true) or return ERR (safe=false)
    pub safe: bool,
}

impl ToTimestampFunc {
    pub fn new() -> Self {
        Self {
            signature: Signature::variadic_any(Volatility::Immutable),
            safe: false,
        }
    }

    pub fn new_with_safe(safe: bool) -> Self {
        Self {
            signature: Signature::variadic_any(Volatility::Immutable),
            safe,
        }
    }
}

To use the alternative 'safe' mode for these functions is as simple as

let to_timestamp = ToTimestampFunc::new_with_safe(true);
session_context.register_udf(ScalarUDF::new_from_impl(to_timestamp));

Unfortunately this only affect sql queries - any calls to the to_timestamp(args: Vec) function will not use the new definition as registered in the function registry. This is because that function and every other function like it use a static singleton instance that only uses a ::new() call to initial it and there is no way that I can see to replace that instance.

Describe the solution you'd like

I see a few possible solutions to this:

  • Acknowledge the difference in behaviour in documentation and recommend using ctx.udf("to_timestamp").unwrap().call(args) instead of the to_timestamp() function anytime 'safe' mode is required. This is less than ideal imho as it can lead to confusion and unintuitive behavior.
  • Provide a mechanism (function argument to invoke perhaps) to provide a SessionContext to all UDF's so that they can adjust behaviour based on configuration at call time.
  • Update the to_timestamp() function and all similar functions to have alternatives that do not use the statically defined version of the UDF but rather create it on-demand or use the definition saved in the session state. As examples:
    // the existing function    
    pub fn to_timestamp(args: Vec<Expr>) -> Expr {
        super::to_timestamp().call(args)
    }

    pub fn to_timestamp_safe(args: Vec<Expr>) -> Expr {
        ScalarUDF::new_from_impl(ToTimestampFunc::new_with_safe(true)).call(args)
    }

    // or
    pub fn to_timestamp_from_context(ctx &Arc<SessionContext>, args: Vec<Expr>) -> Expr {
        ctx.udf("to_timestamp").unwrap().call(args)
    }

   // varargs would have been a really nice solution as well but 🤷 
  • something else I haven't thought of.

Any opinions, suggestions and critiques would be welcome.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions