-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: cast (Large)List to FixedSizeList #5081
Conversation
e5aed11
to
456c883
Compare
Perhaps we could provide these on the arrays themselves and then use this within the cast kernel, much like we do for the reverse transformation - https://docs.rs/arrow-array/latest/arrow_array/array/struct.GenericListArray.html#impl-From%3CFixedSizeListArray%3E-for-GenericListArray%3COffsetSize%3E |
@tustvold I'm actually not sure how to satisfy the One option is to change the implementation so that if Are there other cases of types where |
I can look at this. The reason I didn't put it there was I didn't think I could bring the cast kernel into that file, since we want to cast the child values. But maybe that's okay? Or I can refactor somehow to do the child cast in |
The string parsing ones I would expect when safe is true to fill the FixedSizeListArray with a null value for the errored index. This is consistent with how it behaves for other casts. The nullability concern is valid, and is why DataFusion has both CastExpr and TryCastExpr. Tbh I'm not a fan of the safe option, but it is important for spark compatibility
Aah forgot about that, perhaps not worth it then |
Ha I always get confused as to whether "safe" means "I will silently create nulls" or "I will raise an error if the cast cannot be done". PyArrow has a somewhat opposite definition of safe. |
arrow-cast/src/cast.rs
Outdated
/// * List to FixedSizeList: the underlying data type is cast. If the list size is different | ||
/// than the FixedSizeList size and safe casting is requested, then lists are | ||
/// truncated or filled will some value to achieve the requested size. If the output | ||
/// field is nullable, the fill value is null, otherwise it is the first value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this behavior, but I'm not sure a better way in the current framework. Personally, I don't think I ever care to use safe=True
here. But I'd also like it to not be too shocking to those that do use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We allow nulls in non-nullable children provided they are masked by the parents null mask.
I'm not sure about truncating, I'll think on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I hadn't thought about having the list itself just making anything improperly sized be null. That seems like it would be reasonable behavior. We can ignore truncation, I don't think we want that.
d422617
to
ad87f8e
Compare
I plan to have a play with this today to see if I can't come up with something a little simpler |
I've pushed a PR that uses MutableArrayData to avoid doing multiple passes of the array, I think it should be faster and I find it a bit easier to follow. PTAL and let me know what you think |
} | ||
mutable.extend_nulls(size as _); | ||
nulls.as_mut().unwrap().set_bit(idx, false); | ||
last_pos = end_pos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works, but I found the logic of last_pos
confusing. I'd like to add a comment about how it pads
When we detect a list doesn't start at the correct point
we usemutable.extend()
to pad the previous list.last_pos
tracks the ending position of the last incorrectly sized list.
However, I'm still confused on where the truncation of too long values happens. Where is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't it replaces it with null or errors, why do you want it to truncate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean truncate logically. I mean that if a section of values has list_size + 1
elements, won't that shift all the subsequent values to the right by 1
? I would think to solve this we need to make sure truncate / shift values over to the left as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a section of values has list_size + 1
elements, we would add nulls, and then set last_pos
to the end of that slice. I will add a commit explicitly testing this and some more comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I see how this works now. I misunderstood how MutableArrayData
works, but re-reading the docs this makes sense now.
ea8d3d7
to
2725ee4
Compare
2725ee4
to
d474e38
Compare
I can't approve my own PR, but this LGTM. Thanks for your help @tustvold. |
Until you look under the hood at least 😅, it is probably not how I would build such a construct today, but it is useful for sure. |
Which issue does this PR close?
Closes #4728
Rationale for this change
Adds cast from list to FixedSizeList. I intend to use this in DataFusion, where I'd like to allow users to write array literals in SQL (which default to List) and have the planner cast them to FixedSizeList when appropriate.
What changes are included in this PR?
cast_with_options()
now supportsList
toFixedSizeList
.Are there any user-facing changes?
This adds a new feature. A suite of tests are added. Let me know if you think there are important cases missing.