-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-20512: [Python] Numpy conversion doesn't account for ListArray offset #15210
Merged
assignUser
merged 7 commits into
apache:master
from
wjones127:ARROW-18400-nested-quadratic
Jan 17, 2023
Merged
Changes from 1 commit
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
449af74
test: create tests reproducing underlying issue
wjones127 7844fa9
fix: fix numpy list conversion for slices of arrays
wjones127 27b955f
test: remove unnecessary test
wjones127 970389f
doc: add clearer description of BaseListArray.values()
wjones127 e632099
test: fix numpy nested array creation
wjones127 af022e1
test: make sure to test null values
wjones127 7c201d7
add extra test
jorisvandenbossche File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this expected behavior or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just commented on the JIRA as well, but so in the current design, we shouldn't change that. It is
Flatten()
that already implements this, and thevalues
being unsliced ensures that they still match theoffsets
of a sliced ListArray.It is certainly confusing, though. I wonder if we should give it a more scary name like "raw_values"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on renaming to
raw_values
. Or at the very least, we should modify the doc comment. Right now it's not clear it doesn't account for offset and length.The problem with Flatten though is that it removes the null values from the values array, but in this case we want them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we use the offsets to slice into the flat values, and those offsets take into account potential values behind a null?
This corner case is covered by existing tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add a test.
And you might be right about flatten, it does include the nulls in the values (not the list). For some reason earlier I had gotten the impression it doesn't, while I was debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but your example uses an actual null inside a list, not a "null list". For example, here the first null is not in the flattened output:
And I suppose behind that null can be any data (the default when constructed above is that there is no data behind, so the offset doesn't increment for that list element):
It's a bit tricky to construct manually, but something like:
But, so I am not sure if for this case you actually need to flattened (with nulls removed) or not for this case of converting to numpy.
The offsets still assume those unused values are present, so it was maybe actually a good call to think the "Flattened" values (with values behind nulls removed) was not the correct method to use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh now I remember, and I think I understand what you are saying here better now. For the fixed size lists, the values behind a null entry in the list are removed when we call
Flatten()
. When we go back to try to reconstruct the lists based on offsets, the offsets produced byvalue_offset
are all invalid since they don't account for the values we dropped inFlatten()
.