Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[substrait] improve handling for user-defined functionality #13318

Open
vbarua opened this issue Nov 8, 2024 · 4 comments
Open

[substrait] improve handling for user-defined functionality #13318

vbarua opened this issue Nov 8, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@vbarua
Copy link
Contributor

vbarua commented Nov 8, 2024

Is your feature request related to a problem or challenge?

Substrait has a number of mechanisms that allow users to define custom functionality. These include:

The current Substrait consumer has limited support for handling this custom functionality.

Describe the solution you'd like

As a user I would like to be able to re-use most of the DataFusion Substrait consumer code, but also be able to configure and/or provide handlers for our custom functionality.

Describe alternatives you've considered

No response

Additional context

No response

@vbarua vbarua added the enhancement New feature or request label Nov 8, 2024
@vbarua
Copy link
Contributor Author

vbarua commented Nov 8, 2024

take

@vbarua
Copy link
Contributor Author

vbarua commented Nov 8, 2024

What I'm hoping to drive with the following wall of text is:

  1. Agreement that this is worth improving.
  2. Feedback and suggestions for possible approaches.

Current Handling State

The following documents the current state of handling for the various extension points and discusses some of the issues with them.

Extension Relations

Substrait has 3 relation types:

which consume 0, 1 or more than 1 inputs respectively alongside a Protobuf Any detail message.

The current API to decode these extensions re-uses the following method from the SerializerRegistry trait:

fn deserialize_logical_plan(
    &self,
    name: &str,
    bytes: &[u8],
) -> Result<Arc<dyn UserDefinedLogicalNode>>;

The use of this method assumes that the detail message maps directly to a UserDefineLogicalNode, which isn't always the case. Consider the following message intended for use with ExtensionSingleRel

message UnnestDetail {
  repeated uint32 col_refs = 1;
}

there isn't enough information in the message alone to produce a valid UserDefinedLogicalNode, however if we had access to the input node it becomes relatively straightforward.

A simpler API for this would be a hook like:

fn handle_extension_single_rel(e: &ExtensionSingleRel) -> Result<LogicalPlan>

to allow users to fully own the deserialization logic.

Extension Table

The Substrait consumer does not provide any hooks to handle extension tables. An API for this could be something like:

fn handle_extension_table(e: &ExtensionTable) -> Result<LogicalPlan>

This may be more flexible that is actually needed.

User-Defined Types

The Substrait consumer does not provide any hooks to handle user-defined types. An API for this could be something like.

fn handle_user_defined_type(ud: &UserDefined) -> Result<DataType>

User-Defined Functions

The Substrait consumer decodes Substrait functions based on name of the function in Substrait. The following is part of the handling code:

    if let Some(func) = ctx.state().scalar_functions().get(fn_name) {
        Ok(Expr::ScalarFunction(expr::ScalarFunction::new_udf(
            func.to_owned(),
            args,
        )))

To map a Substrait function to a DataFusion function, users can bind the DataFusion implementation to the same name in the session context. This works reasonable well, but the fact the function resolution ignores the extension file can cause issues.

For example, it's perfectly valid to have 2 function with the same name exist in different extensions. The spec defines an integer division function:

# functions_arithmetic.yaml
%YAML 1.2
---
scalar_functions:
  -
    name: "divide"
    impls:
      - args:
          - name: x
            value: i64
          - name: y
            value: i64
        return: i64

but we may wish to provide a variant that returns a float as in MySQL:

# functions_mysql.yaml
%YAML 1.2
---
scalar_functions:
  -
    name: "divide"
      - args:
          - name: x
            value: i64
          - name: y
            value: i64
        return: fp64

However, these cannot be distinguished by just the name.

Improving the API for function resolution should probably be its own issue as it entails a fair bit of work.

Advanced Extensions

DataFusion currently ignores Advanced Extensions entirely. Advanced Extensions come in two forms:

  • optimizations: additional metadata which does not influence semantics and are ignorable.
  • enhancements: additional metadata which does influence semantics and cannot be ignored.

In my opinion, these are the hardest types of extensions to build around because they contain arbitrary metadata associated with existing relations. This may be applied/needed before the conversion, after the conversion or during the conversion. Trying to account for all possible uses of this metadata is quite tricky.

In practice, what has worked well for the Java library is to make the code for converting the various components of a relation reusable so that users can fully customize the conversion of the relation as they need.

Consumer API Design

The current consumer API starts by invoking the

pub async fn from_substrait_plan(
    ctx: &SessionContext,
    plan: &Plan,
) -> Result<LogicalPlan> ...

function and then recursively invokes other utility functions like

pub async fn from_substrait_rel(
    ctx: &SessionContext,
    rel: &Rel,
    extensions: &Extensions,
) -> Result<LogicalPlan> ...

pub async fn from_substrait_rex(
    ctx: &SessionContext,
    e: &Expression,
    input_schema: &DFSchema,
    extensions: &Extensions,
) -> Result<Expr>

pub async fn from_substrait_sorts(
    ctx: &SessionContext,
    substrait_sorts: &Vec<SortField>,
    input_schema: &DFSchema,
    extensions: &Extensions,
) -> Result<Vec<Sort>> ...

This API threads a SessionContext and Extension context through all the calls, even though they are only needed in a handful of functions.

We could potentially add more parameters for the handlers, or even a struct of handlers to pass through, but I think it would be better to refactor the consumer code to be associated with a struct and associate the handlers with that instead.

A design I have though about and experimented with is something like:

trait SubstraitConsumer {
    fn get_context() -> SessionContext;
    fn get_extensions() -> Extensions;
    
    async fn from_substrait_rel(rel: &Rel) -> Result<LogicalPlan> {
        <default impl>
    }

    async fn from_substrait_rex(
        e: &Expression,
        input_schema: &DFSchema
    ) -> Result<LogicalPlan> {
        <default impl>
    }

    async fn from_substrait_sorts(
        ctx: &SessionContext,
        substrait_sorts: &Vec<SortField>,
        input_schema: &DFSchema,
        extensions: &Extensions,
    ) -> Result<Vec<Sort>> {
        <default impl>
    }
    ...

    // user extension
    async fn from_extension_single_rel(e: &ExtensionSingleRel) -> Result<LogicalPlan> {
        <default impl which produces an error>
    }
    async fn from_user_defined_type(ud: &UserDefined) -> Result<DataType> {
        <default impl which produces an error>
    }
    ...
}

The default implementations would be sufficient to decode plans with no user-defined extensions, but anyone using custom extensions, types, etc would need to provide their handlings.

I'm still exploring other alternatives, but thought it would be worthwhile to start a discussion around this and gather peoples thoughts and feedback.

API Breakage Concerns

As part of this work, I'm expecting a certain amount of API breakage as I don't expect to be able we'll be able to nail the design of this in one go.

What we can do though is limit the amount of API churn for folks who only care about standard Substrait. For example

pub async fn from_substrait_plan(
    ctx: &SessionContext,
    plan: &Plan,
) -> Result<LogicalPlan> ...

can still be used as the entry point for most users.

We can minimize breakage for these users while we iterate and improve the consumer. Folks who want to use their own extensions will likely experience more breakage as we iterate, but at the moment they can't use them at all.

@vbarua
Copy link
Contributor Author

vbarua commented Nov 8, 2024

cc: @westonpace @tokoko @Blizzara

As users of the Substrait consumer I'm interested to hear if any of you have thoughts, concerns or opinions about this before I get too far along with anything.

@Blizzara
Copy link
Contributor

Blizzara commented Nov 9, 2024

I'm broadly very much in agreement and excited about this! It's been on the back of my mind that we should provide more extendability, but haven't made any concrete progress towards it myself. I'll review the idea in detail next week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants