add UnifySchemas() #335

loicalleyne · 2025-03-25T22:32:07Z

Rationale for this change

Implement schema unification function similar to pyarrow.unify_schemas

What changes are included in this PR?

func UnifySchemas(promotePermissive bool, schemas ...*arrow.Schema) (*arrow.Schema, error)

Are these changes tested?

yes

Are there any user-facing changes?

arrow/util/schemas.UnifySchemas()

singh1203

I’m diving deep to really grasp how this implementation works!

arrow/util/schemas/unify.go

zeroshade

I think converting the entire schema to a tree of nodes might actually be overkill and unnecessarily complex.

It might make more sense to do the following:

Create a SchemaBuilder object which contains a []arrow.Field and a map of name -> list of indexes (we can error on duplicate field names)
Add a MergeWith method to arrow.Field which contains the logic for how to merge one field with another
Finally, create a mergeTypes function that would be called from the MergeWith method of arrow.Field, which can then handle just how do you merge the current type with another (possibly promoting, etc.). This is where the logic for the individual types would be handled.

This way you don't need to create an entire tree from each schema. You only need to create a SchemaBuilder, then loop over the schemas calling AddField for each top level field of the schema. AddField would simply lookup whether or not the field exists, calling Field.MergeWith if it does or appending if it doesn't.

The implementation of merging a struct type would be to just create a schema builder and do the same thing between the struct types (since a struct type is essentially the same as a schema in a lot of ways).

Full disclosure: I got this idea by looking at the current implementation of UnifySchemas in arrow C++ (https://github.com/apache/arrow/blob/7c18001f0d7bd97471237719702c33165858bba7/cpp/src/arrow/type.cc#L426). But I do believe the result would be cleaner, simpler, and more efficient (mostly because we wouldn't need to create an entire tree of nodes for each schema).

Thoughts?

zeroshade · 2025-03-26T15:32:22Z

arrow/util/schemas/unify.go

+	"slices"
+
+	"maps"


can we remove the extra blank line here?

zeroshade · 2025-03-26T15:33:20Z

arrow/util/schemas/unify.go

+	err          error
+}
+
+// UnifySchemas unifies multiple schemas into a single schema. If promotePermissive is true, the unification process will promote integer types to larger integer types, integer types to floating-point types, STRING to LARGE_STRING, LIST to LARGE_LIST and LIST_VIEW to LARGE_LIST_VIEW. If promotePermissive is false, the unification process will not allow type conversion and will return an error if a type conflict is found.


multiple lines please instead of one single line

zeroshade · 2025-03-26T15:35:38Z

arrow/util/schemas/unify.go

+}
+
+// UnifySchemas unifies multiple schemas into a single schema. If promotePermissive is true, the unification process will promote integer types to larger integer types, integer types to floating-point types, STRING to LARGE_STRING, LIST to LARGE_LIST and LIST_VIEW to LARGE_LIST_VIEW. If promotePermissive is false, the unification process will not allow type conversion and will return an error if a type conflict is found.
+func UnifySchemas(promotePermissive bool, schemas ...*arrow.Schema) (*arrow.Schema, error) {


we can avoid the len check and enforce things by changing the signature to:

func UnifySchemas(promotePermissive bool, first *arrow.Schema, schemas ...*arrow.Schema) (*arrow.Schema, error)

Then you just need to verify that none of the schemas are nil, and if len(schemas) < 1 you can just return first.

zeroshade · 2025-03-26T15:37:46Z

arrow/util/schemas/unify.go

+	for _, s := range schemas[1:] {
+		u.new = newTreeFromSchema(s)
+		u.unify()


sadly we probably need to have a nil check here and then error out if nil

zeroshade · 2025-03-26T15:39:27Z

arrow/util/schemas/unify.go

+func mergeBool(a, b bool) bool {
+	return a || b
+}


do we really need this to be a function? It feels like it would be fewer keystrokes and shorter to just have the || explicit wherever we are calling this.

zeroshade · 2025-03-26T15:45:13Z

arrow/util/schemas/unify.go

+	var err error
+	if f.err != nil {
+		err = errors.Join(err, f.err)
+	}


should this just be err = f.err?

zeroshade · 2025-03-26T15:50:56Z

arrow/util/schemas/unify.go

+func (f *treeNode) assignChild(child *treeNode) {
+	f.children = append(f.children, child)
+	f.childmap[child.name] = child
+}


Arrow allows multiple fields in a schema to have the same name, so a map of string isn't sufficient unless you explicitly error to prevent overwriting / losing a child if there are multiple fields with the same name.

singh1203 · 2025-03-28T07:32:45Z

arrow/util/schemas/unify.go

+	defer func(s *arrow.Schema) (*arrow.Schema, error) {
+		if pErr := recover(); pErr != nil {
+			return nil, fmt.Errorf("schema problem: %v", pErr)
+		}
+		return s, nil
+	}(s)


I have a doubt: Since s is passed as a parameter to the defer function, isn't it captured as nil at that moment? Also, is this the correct way to pass s to defer, or should we rely on closure instead?

add UnifySchemas()

eafc2d9

loicalleyne requested a review from zeroshade as a code owner March 25, 2025 22:32

singh1203 reviewed Mar 26, 2025

View reviewed changes

arrow/util/schemas/unify.go Outdated Show resolved Hide resolved

fix typo in comment

ed2e9ed

zeroshade requested changes Mar 27, 2025

View reviewed changes

singh1203 reviewed Mar 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add UnifySchemas() #335

add UnifySchemas() #335

Uh oh!

loicalleyne commented Mar 25, 2025 •

edited

Loading

Uh oh!

singh1203 left a comment

Uh oh!

Uh oh!

zeroshade left a comment

Uh oh!

zeroshade Mar 26, 2025

Uh oh!

zeroshade Mar 26, 2025

Uh oh!

zeroshade Mar 26, 2025

Uh oh!

zeroshade Mar 26, 2025

Uh oh!

zeroshade Mar 26, 2025

Uh oh!

zeroshade Mar 26, 2025

Uh oh!

zeroshade Mar 26, 2025

Uh oh!

singh1203 Mar 28, 2025

Uh oh!

Uh oh!

add UnifySchemas() #335

Are you sure you want to change the base?

add UnifySchemas() #335

Uh oh!

Conversation

loicalleyne commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

singh1203 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

loicalleyne commented Mar 25, 2025 •

edited

Loading