-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[substrait] improve handling for user-defined functionality #13318
Comments
take |
What I'm hoping to drive with the following wall of text is:
Current Handling StateThe following documents the current state of handling for the various extension points and discusses some of the issues with them. Extension RelationsSubstrait has 3 relation types: which consume 0, 1 or more than 1 inputs respectively alongside a Protobuf Any detail message. The current API to decode these extensions re-uses the following method from the SerializerRegistry trait: fn deserialize_logical_plan(
&self,
name: &str,
bytes: &[u8],
) -> Result<Arc<dyn UserDefinedLogicalNode>>; The use of this method assumes that the message UnnestDetail {
repeated uint32 col_refs = 1;
} there isn't enough information in the message alone to produce a valid A simpler API for this would be a hook like: fn handle_extension_single_rel(e: &ExtensionSingleRel) -> Result<LogicalPlan> to allow users to fully own the deserialization logic. Extension TableThe Substrait consumer does not provide any hooks to handle extension tables. An API for this could be something like: fn handle_extension_table(e: &ExtensionTable) -> Result<LogicalPlan> This may be more flexible that is actually needed. User-Defined TypesThe Substrait consumer does not provide any hooks to handle user-defined types. An API for this could be something like. fn handle_user_defined_type(ud: &UserDefined) -> Result<DataType> User-Defined FunctionsThe Substrait consumer decodes Substrait functions based on name of the function in Substrait. The following is part of the handling code: if let Some(func) = ctx.state().scalar_functions().get(fn_name) {
Ok(Expr::ScalarFunction(expr::ScalarFunction::new_udf(
func.to_owned(),
args,
))) To map a Substrait function to a DataFusion function, users can bind the DataFusion implementation to the same name in the session context. This works reasonable well, but the fact the function resolution ignores the extension file can cause issues. For example, it's perfectly valid to have 2 function with the same name exist in different extensions. The spec defines an integer division function: # functions_arithmetic.yaml
%YAML 1.2
---
scalar_functions:
-
name: "divide"
impls:
- args:
- name: x
value: i64
- name: y
value: i64
return: i64 but we may wish to provide a variant that returns a float as in MySQL: # functions_mysql.yaml
%YAML 1.2
---
scalar_functions:
-
name: "divide"
- args:
- name: x
value: i64
- name: y
value: i64
return: fp64 However, these cannot be distinguished by just the name. Improving the API for function resolution should probably be its own issue as it entails a fair bit of work. Advanced ExtensionsDataFusion currently ignores Advanced Extensions entirely. Advanced Extensions come in two forms:
In my opinion, these are the hardest types of extensions to build around because they contain arbitrary metadata associated with existing relations. This may be applied/needed before the conversion, after the conversion or during the conversion. Trying to account for all possible uses of this metadata is quite tricky. In practice, what has worked well for the Java library is to make the code for converting the various components of a relation reusable so that users can fully customize the conversion of the relation as they need. Consumer API DesignThe current consumer API starts by invoking the pub async fn from_substrait_plan(
ctx: &SessionContext,
plan: &Plan,
) -> Result<LogicalPlan> ... function and then recursively invokes other utility functions like pub async fn from_substrait_rel(
ctx: &SessionContext,
rel: &Rel,
extensions: &Extensions,
) -> Result<LogicalPlan> ...
pub async fn from_substrait_rex(
ctx: &SessionContext,
e: &Expression,
input_schema: &DFSchema,
extensions: &Extensions,
) -> Result<Expr>
pub async fn from_substrait_sorts(
ctx: &SessionContext,
substrait_sorts: &Vec<SortField>,
input_schema: &DFSchema,
extensions: &Extensions,
) -> Result<Vec<Sort>> ... This API threads a SessionContext and Extension context through all the calls, even though they are only needed in a handful of functions. We could potentially add more parameters for the handlers, or even a struct of handlers to pass through, but I think it would be better to refactor the consumer code to be associated with a struct and associate the handlers with that instead. A design I have though about and experimented with is something like: trait SubstraitConsumer {
fn get_context() -> SessionContext;
fn get_extensions() -> Extensions;
async fn from_substrait_rel(rel: &Rel) -> Result<LogicalPlan> {
<default impl>
}
async fn from_substrait_rex(
e: &Expression,
input_schema: &DFSchema
) -> Result<LogicalPlan> {
<default impl>
}
async fn from_substrait_sorts(
ctx: &SessionContext,
substrait_sorts: &Vec<SortField>,
input_schema: &DFSchema,
extensions: &Extensions,
) -> Result<Vec<Sort>> {
<default impl>
}
...
// user extension
async fn from_extension_single_rel(e: &ExtensionSingleRel) -> Result<LogicalPlan> {
<default impl which produces an error>
}
async fn from_user_defined_type(ud: &UserDefined) -> Result<DataType> {
<default impl which produces an error>
}
...
} The default implementations would be sufficient to decode plans with no user-defined extensions, but anyone using custom extensions, types, etc would need to provide their handlings. I'm still exploring other alternatives, but thought it would be worthwhile to start a discussion around this and gather peoples thoughts and feedback. API Breakage ConcernsAs part of this work, I'm expecting a certain amount of API breakage as I don't expect to be able we'll be able to nail the design of this in one go. What we can do though is limit the amount of API churn for folks who only care about standard Substrait. For example pub async fn from_substrait_plan(
ctx: &SessionContext,
plan: &Plan,
) -> Result<LogicalPlan> ... can still be used as the entry point for most users. We can minimize breakage for these users while we iterate and improve the consumer. Folks who want to use their own extensions will likely experience more breakage as we iterate, but at the moment they can't use them at all. |
cc: @westonpace @tokoko @Blizzara As users of the Substrait consumer I'm interested to hear if any of you have thoughts, concerns or opinions about this before I get too far along with anything. |
I'm broadly very much in agreement and excited about this! It's been on the back of my mind that we should provide more extendability, but haven't made any concrete progress towards it myself. I'll review the idea in detail next week! |
Is your feature request related to a problem or challenge?
Substrait has a number of mechanisms that allow users to define custom functionality. These include:
The current Substrait consumer has limited support for handling this custom functionality.
Describe the solution you'd like
As a user I would like to be able to re-use most of the DataFusion Substrait consumer code, but also be able to configure and/or provide handlers for our custom functionality.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: