-
Notifications
You must be signed in to change notification settings - Fork 34
add UnifySchemas() #335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add UnifySchemas() #335
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m diving deep to really grasp how this implementation works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think converting the entire schema to a tree of nodes might actually be overkill and unnecessarily complex.
It might make more sense to do the following:
- Create a SchemaBuilder object which contains a
[]arrow.Field
and a map of name -> list of indexes (we can error on duplicate field names) - Add a
MergeWith
method toarrow.Field
which contains the logic for how to merge one field with another - Finally, create a
mergeTypes
function that would be called from theMergeWith
method ofarrow.Field
, which can then handle just how do you merge the current type with another (possibly promoting, etc.). This is where the logic for the individual types would be handled.
This way you don't need to create an entire tree from each schema. You only need to create a SchemaBuilder, then loop over the schemas calling AddField
for each top level field of the schema. AddField would simply lookup whether or not the field exists, calling Field.MergeWith
if it does or appending if it doesn't.
The implementation of merging a struct type would be to just create a schema builder and do the same thing between the struct types (since a struct type is essentially the same as a schema in a lot of ways).
Full disclosure: I got this idea by looking at the current implementation of UnifySchemas in arrow C++ (https://github.com/apache/arrow/blob/7c18001f0d7bd97471237719702c33165858bba7/cpp/src/arrow/type.cc#L426). But I do believe the result would be cleaner, simpler, and more efficient (mostly because we wouldn't need to create an entire tree of nodes for each schema).
Thoughts?
"slices" | ||
|
||
"maps" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove the extra blank line here?
err error | ||
} | ||
|
||
// UnifySchemas unifies multiple schemas into a single schema. If promotePermissive is true, the unification process will promote integer types to larger integer types, integer types to floating-point types, STRING to LARGE_STRING, LIST to LARGE_LIST and LIST_VIEW to LARGE_LIST_VIEW. If promotePermissive is false, the unification process will not allow type conversion and will return an error if a type conflict is found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple lines please instead of one single line
} | ||
|
||
// UnifySchemas unifies multiple schemas into a single schema. If promotePermissive is true, the unification process will promote integer types to larger integer types, integer types to floating-point types, STRING to LARGE_STRING, LIST to LARGE_LIST and LIST_VIEW to LARGE_LIST_VIEW. If promotePermissive is false, the unification process will not allow type conversion and will return an error if a type conflict is found. | ||
func UnifySchemas(promotePermissive bool, schemas ...*arrow.Schema) (*arrow.Schema, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can avoid the len
check and enforce things by changing the signature to:
func UnifySchemas(promotePermissive bool, first *arrow.Schema, schemas ...*arrow.Schema) (*arrow.Schema, error)
Then you just need to verify that none of the schemas are nil
, and if len(schemas)
< 1 you can just return first
.
for _, s := range schemas[1:] { | ||
u.new = newTreeFromSchema(s) | ||
u.unify() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sadly we probably need to have a nil
check here and then error out if nil
func mergeBool(a, b bool) bool { | ||
return a || b | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need this to be a function? It feels like it would be fewer keystrokes and shorter to just have the ||
explicit wherever we are calling this.
var err error | ||
if f.err != nil { | ||
err = errors.Join(err, f.err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this just be err = f.err
?
func (f *treeNode) assignChild(child *treeNode) { | ||
f.children = append(f.children, child) | ||
f.childmap[child.name] = child | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow allows multiple fields in a schema to have the same name, so a map of string isn't sufficient unless you explicitly error to prevent overwriting / losing a child if there are multiple fields with the same name.
defer func(s *arrow.Schema) (*arrow.Schema, error) { | ||
if pErr := recover(); pErr != nil { | ||
return nil, fmt.Errorf("schema problem: %v", pErr) | ||
} | ||
return s, nil | ||
}(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a doubt: Since s
is passed as a parameter to the defer
function, isn't it captured as nil
at that moment? Also, is this the correct way to pass s
to defer
, or should we rely on closure instead?
Rationale for this change
Implement schema unification function similar to pyarrow.unify_schemas
What changes are included in this PR?
func UnifySchemas(promotePermissive bool, schemas ...*arrow.Schema) (*arrow.Schema, error)
Are these changes tested?
yes
Are there any user-facing changes?
arrow/util/schemas.UnifySchemas()