Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DfSchema wrap SchemaRef #4680

Closed
tustvold opened this issue Dec 20, 2022 · 7 comments · Fixed by #9595
Closed

Make DfSchema wrap SchemaRef #4680

tustvold opened this issue Dec 20, 2022 · 7 comments · Fixed by #9595
Labels
enhancement New feature or request

Comments

@tustvold
Copy link
Contributor

tustvold commented Dec 20, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently DataFusion has two ways to represent a schema, Schema and DFSchema. The former is the representation used by arrow, and in most components. DFSchema appears to "enhance" an arrow schema with the notion of a qualifier.

I'm not entirely sure of the history of this split, but to the uninitiated the split is confusing and frustrating. It also results in a non-trivial amount of schema munging logic to convert to/from the relevant representations

Describe the solution you'd like

I would like to change DfSchema to be

struct DfSchema {
    schema: SchemaRef,
    fields: Vec<DfFieldMetadata>
}

struct DfFieldMetadata {
    qualifier: Option<String>,
}

We could then make DfSchema automatically deref to SchemaRef, or at the very least implement AsRef<SchemaRef>, avoiding a lot of code that ends up looking like

let schema: Schema = self.plan.schema().as_ref().into();
Arc::new(schema)

Components wishing to combine the information can easily zip the two together, we could even assist this by adding

struct DfField<'a> {
    field: &'a Field,
    metadata: &'a DfFieldMetadata
}

impl DfSchema {

    fn df_fields() -> impl Iterator<Item=DfField<'_>> + '_ {
        self.arrow_schema.fields().iter().zip(&self.fields).map(|(field, metadata)| DfField { field, metadata })
    }
}

Describe alternatives you've considered

We could not do this

Additional context
Add any other context or screenshots about the feature request here.

@tustvold tustvold added the enhancement New feature or request label Dec 20, 2022
@alamb
Copy link
Contributor

alamb commented Dec 20, 2022

I'm not entirely sure of the history of this split, but to the uninitiated the split is confusing and frustrating. It also results in a non-trivial amount of schema munging logic to convert to/from the relevant representations

I wrote the original DFSchema wrapping logic back in the day. If it can be improved that would be great.

The key design goal is that DataFusion should always use a DFSchema (aka a qualified name) so there should not be an implicit conversion from SchemaRef --> (implicitly unqualified) DFSchema to avoid subtle bugs where the qualifiers are stripped off.

Going the other way, DFSchema --> Schema is 👍

cc @Dandandan @jackwener @mingmwang @andygrove in case you have additional thoughts

@tustvold
Copy link
Contributor Author

This mostly works but runs into issues with the handling of schemas in datafusion-expr. In particular they have a fairly hard-coded assumption that they can pass around DFField, concatenate them and munge them back into a DFSchema. It would be possible to workaround this, but it would be a fairly non-trivial amount of code-churn.

@comphead
Copy link
Contributor

Attaching link to DFSchema struct for convenience https://github.com/apache/arrow-datafusion/blob/main/datafusion/common/src/dfschema.rs#L108

@matthewmturner
Copy link
Contributor

Ive been doing some work on planning performance and in order to lay a good foundation for the next steps I think this is necessary. So my plan is to start working on this shortly.

CC @alamb @tustvold @comphead @avantgardnerio

@alamb
Copy link
Contributor

alamb commented Jan 12, 2024

Ive been doing some work on planning performance and in order to lay a good foundation for the next steps I think this is necessary. So my plan is to start working on this shortly.

CC @alamb @tustvold @comphead @avantgardnerio

Thank you @matthewmturner 🙏

BTW I started hacking on some version of what this might look like in #7944 but I haven't had a chance to finish it

@matthewmturner
Copy link
Contributor

@alamb yes i saw, thanks. havent decided yet whether i will pick up on that PR or create a new one.

@alamb
Copy link
Contributor

alamb commented Jan 12, 2024

Feel free to do anyting you want with the code / PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants