Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Resolve function calls by name during planning #8447

Closed
wants to merge 11 commits into from

Conversation

edmondop
Copy link
Contributor

@edmondop edmondop commented Dec 7, 2023

Addresses #8157

Implement an AnalyzerRules that rewrites Expr::ScalarFunction replacing ScalarFunctionDefinition::Name into ScalarFunctionDefinition::BuiltIn or ScalarFunctionDefinition:ScalarUDF at plan time.

Introduces a new AnalyzerConfig to wrap a FunctionRegistry and ConfigOptions

Ignore from here, left for reference

Unclear how to pass the FunctionRegistry down to the analyzer rule.

Option 1: When creating the analyzer

Using this approach means giving up the Default implementation, as well as doing some refactoring in https://github.com/apache/arrow-datafusion/blob/c8e1c84e6b4f1292afa6f5517bc6978b55758723/datafusion/core/src/execution/context/mod.rs#L1348, since the analyzer needs a FunctionRegistry and the SessionState implements that trait

Option 2: When invoking the analyzer

Using this approach means changing the pub trait Analyzer adding a fourth parameter.

pub trait AnalyzerRule {
    /// Rewrite `plan`
    fn analyze(&self, plan: LogicalPlan, config: &ConfigOptions, registry: impl FunctionRegistry + Send + Sync) -> Result<LogicalPlan>;

    /// A human readable name for this analyzer rule
    fn name(&self) -> &str;
}

However this creates on its own a problem

the trait `AnalyzerRule` cannot be made into an object
consider moving `analyze` to another traitrustc[Click for full compiler diagnostic](rust-analyzer-diagnostics-view:/diagnostic%20message%20%5B2%5D?2#file%3A%2F%2F%2Fworkspace%2Farrow-datafusion%2Fdatafusion%2Foptimizer%2Fsrc%2Fanalyzer%2Fmod.rs)
mod.rs(55, 8): for a trait to be "object safe" it needs to allow building a vtable to allow the call to be resolvable dynamically; for more information visit <https://doc.rust-lang.org/reference/items/traits.html#object-safety>
the trait `analyzer::AnalyzerRule` cannot be made into an object
consider moving `analyze` to another traitrustc[Click for full compiler diagnostic](rust-analyzer-diagnostics-view:/diagnostic%20message%20%5B5%5D?5#file%3A%2F%2F%2Fworkspace%2Farrow-datafusion%2Fdatafusion%2Foptimizer%2Fsrc%2Fanalyzer%2Fmod.rs)
mod.rs(55, 8): for a trait to be "object safe" it needs to allow building a vtable to allow the call to be resolvable dynamically; for more information visit <https://doc.rust-lang.org/reference/items/traits.html#object-safety>
No quick fixes available

Maybe the best solution is to create an intermediate struct that contains the various HashMaps of SessionState, an pass that struct to the Analyzer, as well as the SessionState constructor? Suggestions are welcome :)

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate labels Dec 7, 2023
@alamb
Copy link
Contributor

alamb commented Dec 7, 2023

Thanks @edmondop -- I hope to check this out in more detail tomorrow

@alamb
Copy link
Contributor

alamb commented Dec 8, 2023

Maybe the best solution is to create an intermediate struct that contains the various HashMaps of SessionState, an pass that struct to the Analyzer, as well as the SessionState constructor? Suggestions are welcome :)

Do you mean follow the model of try_optimize

/// `OptimizerRule` transforms one [`LogicalPlan`] into another which
/// computes the same results, but in a potentially more efficient
/// way. If there are no suitable transformations for the input plan,
/// the optimizer can simply return it as is.
pub trait OptimizerRule {
    /// Try and rewrite `plan` to an optimized form, returning None if the plan cannot be
    /// optimized by this rule.
    fn try_optimize(
        &self,
        plan: &LogicalPlan,
        config: &dyn OptimizerConfig,
    ) -> Result<Option<LogicalPlan>>;
...

Maybe we could do something similar for AnalayzerRules like

pub trait AnalyzerRule {
    /// Rewrite `plan`
    fn analyze(&self, plan: LogicalPlan, config: &AnalyzerConfig) -> Result<LogicalPlan>;
...
}


/// Options to control  DataFusion Analyzer Passes.
pub trait AnalyzerConfig {
    /// Return a function registry for resolving names
    fn function_registry(&self) -> &dyn FunctionRegistry;
    /// return datafusion configuration options
    fn options(&self) -> &ConfigOptions;
}

I am not sure if the compiler will be happy with that construction or not so we may have to play around with it

Let me know if you want me to give it a try

Thanks for pushing this along

cc @2010YOUY01

@edmondop edmondop changed the title DRAFT: Resolve function calls by name during planning Resolve function calls by name during planning Dec 9, 2023
@edmondop edmondop marked this pull request as ready for review December 9, 2023 19:08
@edmondop edmondop marked this pull request as draft December 9, 2023 19:36
@edmondop edmondop changed the title Resolve function calls by name during planning DRAFT: Resolve function calls by name during planning Dec 9, 2023
@edmondop
Copy link
Contributor Author

@alamb apologies, I would need some guidance also on what to do on some unit tests.

For example, this test https://github.com/apache/arrow-datafusion/blob/main/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L1533-L1549
will never pass now, because it uses the now function, that uses call_fn which doesn't get resolved if they analyzer is not applied

How do you think we should proceed?

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

For example, this test https://github.com/apache/arrow-datafusion/blob/main/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L1533-L1549
will never pass now, because it uses the now function, that uses call_fn which doesn't get resolved if they analyzer is not applied

I think these tests will need to be updated to resolve the expressions "now" to the actual now() function

However, I think this brings up a good point about making sure the expr_fns can be easily used with the lower level expr api (like simplification).

For any API on the SessionContext or DataFrame, expr_fn::now() and related functions will be run through the analyzer.

However, if we create the exprs directly like in expr_simplifer (and there are some examples) we have to arrange for the expressions to be resolved.

Maybe we could add a function like Expr::resolve to do this more easily?

So it could be used like

let expr = now()
  .cast(DataType::Integer) // cast now to integer
  .resolve(&registry) // recursively resolve all functions calls, like `now()` into actual function referenes

🤔

I can try and sketch up something like this later today if you want?

@edmondop
Copy link
Contributor Author

edmondop commented Dec 11, 2023 via email

@alamb
Copy link
Contributor

alamb commented Dec 12, 2023

Helping with this PR is on my list, but I am behind on other reviews and bug fixes. Sorry for the delay

@2010YOUY01
Copy link
Contributor

Thank you for taking this issue! @edmondop

Looks like Expr API is exposed in multiple places

  1. DataFrame API
  2. Expr which can be directly evaluated
  3. Constructing LogicalPlan with Expr

I found there are several issues for this rewrite to resolve approach:

  1. This can only make call_fn() backward compatible for DataFrame API, because only DataFrame API is created using context/SessionState, other APIs are created statelessly and can't access context(FunctionRegistry). The inconsistency might be confusing.
  2. Even for DataFrame API, rewrite approach seem also not easy to do: inside DataFrame it will construct LogicalPlans before analyzing, which will validate the schema by checking the function return type. We might have to resolve function names before analyzing. https://github.com/apache/arrow-datafusion/blob/7f312c8a1d82b19f03c640e4a5d7d6437b2ee835/datafusion/expr/src/logical_plan/plan.rs#L1783

Is it possible to just do ctx.call_fn() instead? I think this API change can make function Expr creation consistent, also easier to implement. @alamb

Reference: current Expr for UDFs are created like, ctx.call_fn() is a bit simpler and also can support both builtins and UDF during transition period

    let pow = df.registry().udf("pow")?;
    let expr1 = pow.call(vec![col("a"), col("b")]);

@alamb
Copy link
Contributor

alamb commented Dec 15, 2023

🤔 I played around with this PR some this afternoon.

I came to the same conclusion that we would have to make non trivial API changes to the expr_fn() APIs if we go the "must resolve string names to functions" as part of an analysis pass.

The only other thing I can think of is to change the dependency structure

Right now, the expr_fns are in the datafusion_expr crate

We could potentially move them from datafusion_expr to a new crate (perhaps datafusion_expr_fn) that could rely directly on the built in function definitions 🤔

Let me give that a whirl to see how it would look

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 17, 2024
@github-actions github-actions bot closed this Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants