-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Struct casting field order #8871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
959a210 to
808c285
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This looks good to me. Nice feature improvement.
arrow-cast/src/cast/mod.rs
Outdated
| @@ -221,12 +221,34 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool { | |||
| Decimal32(_, _) | Decimal64(_, _) | Decimal128(_, _) | Decimal256(_, _), | |||
| ) => true, | |||
| (Struct(from_fields), Struct(to_fields)) => { | |||
| from_fields.len() == to_fields.len() | |||
| && from_fields.iter().zip(to_fields.iter()).all(|(f1, f2)| { | |||
| // fast path, all field names are in the same order and same number of fields | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should be on line 228, I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, good catch the order was a bit different at the start when I wrote that comment haha
808c285 to
001f740
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @brancz -- my only concern are the changes to the existing tests. Otherwise it looks good to me
| }); | ||
| } | ||
|
|
||
| // slow path, we match the fields by name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one idea that has come up in the past is to do this mapping calculation once and then use it for both can_cast_types and cast
However, this seems to be strictly better than current main (doesn't slow down existing code and allows more uses, so 👍 to me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also going back and forth, but I decided that an additional allocation in the average case would be far worse in perf cost than comparisons. We're heavily profiling all these code paths, if it's significant, we will come back and improve the perf! 😄
| let struct_array = StructArray::from(vec![ | ||
| ( | ||
| Arc::new(Field::new("b", DataType::Boolean, false)), | ||
| Arc::new(Field::new("a", DataType::Boolean, false)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out these tests were actually wrong to begin with, have a look at the names of the columns, how can a/b be cast to b/c? They only ever worked by accident, and now that we test whether they match, they needed to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would other people find this a regression (aka that they expect struct fields to be treated in order, rather than by name) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s possible someone had it incorrect with it accidentally working but I don’t think that should stop fixing what’s clearly a bug. The previous behavior can actually cause very hidden unexpected behavior when it does accidentally work but field names mismatch or have a different order. I can’t see a valid use case for the incorrect previous behavior and all valid behaviors can be represented still and on total mismatches a user will even get a proper error now instead of potentially silently continuing.
| let struct_array = StructArray::from(vec![ | ||
| ( | ||
| Arc::new(Field::new("b", DataType::Boolean, false)), | ||
| Arc::new(Field::new("a", DataType::Boolean, false)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise why was this test changed?
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @brancz
|
Thanks again @brancz |
Which issue does this PR close?
Closes #8870.
What changes are included in this PR?
Check if field order in from/to casting matches, and if not, attempt to find the fields by name.
Are these changes tested?
Added unit tests (that previously failed, so I separated them in a commit).
Are there any user-facing changes?
No, it's strictly additive functionality.
@alamb @vegarsti