Skip to content

add UnifySchemas() #335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

loicalleyne
Copy link
Contributor

@loicalleyne loicalleyne commented Mar 25, 2025

Rationale for this change

Implement schema unification function similar to pyarrow.unify_schemas

What changes are included in this PR?

func UnifySchemas(promotePermissive bool, schemas ...*arrow.Schema) (*arrow.Schema, error)

Are these changes tested?

yes

Are there any user-facing changes?

arrow/util/schemas.UnifySchemas()

@loicalleyne loicalleyne requested a review from zeroshade as a code owner March 25, 2025 22:32
Copy link
Contributor

@singh1203 singh1203 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m diving deep to really grasp how this implementation works!

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think converting the entire schema to a tree of nodes might actually be overkill and unnecessarily complex.

It might make more sense to do the following:

  1. Create a SchemaBuilder object which contains a []arrow.Field and a map of name -> list of indexes (we can error on duplicate field names)
  2. Add a MergeWith method to arrow.Field which contains the logic for how to merge one field with another
  3. Finally, create a mergeTypes function that would be called from the MergeWith method of arrow.Field, which can then handle just how do you merge the current type with another (possibly promoting, etc.). This is where the logic for the individual types would be handled.

This way you don't need to create an entire tree from each schema. You only need to create a SchemaBuilder, then loop over the schemas calling AddField for each top level field of the schema. AddField would simply lookup whether or not the field exists, calling Field.MergeWith if it does or appending if it doesn't.

The implementation of merging a struct type would be to just create a schema builder and do the same thing between the struct types (since a struct type is essentially the same as a schema in a lot of ways).

Full disclosure: I got this idea by looking at the current implementation of UnifySchemas in arrow C++ (https://github.com/apache/arrow/blob/7c18001f0d7bd97471237719702c33165858bba7/cpp/src/arrow/type.cc#L426). But I do believe the result would be cleaner, simpler, and more efficient (mostly because we wouldn't need to create an entire tree of nodes for each schema).

Thoughts?

Comment on lines +22 to +24
"slices"

"maps"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the extra blank line here?

err error
}

// UnifySchemas unifies multiple schemas into a single schema. If promotePermissive is true, the unification process will promote integer types to larger integer types, integer types to floating-point types, STRING to LARGE_STRING, LIST to LARGE_LIST and LIST_VIEW to LARGE_LIST_VIEW. If promotePermissive is false, the unification process will not allow type conversion and will return an error if a type conflict is found.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple lines please instead of one single line

}

// UnifySchemas unifies multiple schemas into a single schema. If promotePermissive is true, the unification process will promote integer types to larger integer types, integer types to floating-point types, STRING to LARGE_STRING, LIST to LARGE_LIST and LIST_VIEW to LARGE_LIST_VIEW. If promotePermissive is false, the unification process will not allow type conversion and will return an error if a type conflict is found.
func UnifySchemas(promotePermissive bool, schemas ...*arrow.Schema) (*arrow.Schema, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can avoid the len check and enforce things by changing the signature to:

func UnifySchemas(promotePermissive bool, first *arrow.Schema, schemas ...*arrow.Schema) (*arrow.Schema, error)

Then you just need to verify that none of the schemas are nil, and if len(schemas) < 1 you can just return first.

Comment on lines +61 to +63
for _, s := range schemas[1:] {
u.new = newTreeFromSchema(s)
u.unify()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sadly we probably need to have a nil check here and then error out if nil

Comment on lines +488 to +490
func mergeBool(a, b bool) bool {
return a || b
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this to be a function? It feels like it would be fewer keystrokes and shorter to just have the || explicit wherever we are calling this.

Comment on lines +167 to +170
var err error
if f.err != nil {
err = errors.Join(err, f.err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just be err = f.err?

Comment on lines +81 to +84
func (f *treeNode) assignChild(child *treeNode) {
f.children = append(f.children, child)
f.childmap[child.name] = child
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow allows multiple fields in a schema to have the same name, so a map of string isn't sufficient unless you explicitly error to prevent overwriting / losing a child if there are multiple fields with the same name.

Comment on lines +252 to +257
defer func(s *arrow.Schema) (*arrow.Schema, error) {
if pErr := recover(); pErr != nil {
return nil, fmt.Errorf("schema problem: %v", pErr)
}
return s, nil
}(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a doubt: Since s is passed as a parameter to the defer function, isn't it captured as nil at that moment? Also, is this the correct way to pass s to defer, or should we rely on closure instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants