-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-20512: [Python] Numpy conversion doesn't account for ListArray offset #15210
GH-20512: [Python] Numpy conversion doesn't account for ListArray offset #15210
Conversation
wjones127
commented
Jan 5, 2023
•
edited by github-actions
bot
Loading
edited by github-actions
bot
- Closes: [Python] Quadratic memory usage of Table.to_pandas with nested data #20512
|
auto values = checked_cast<ListArray*>(arr_sliced.get())->values(); | ||
auto expected_values = ArrayFromJSON(int16(), "[1, 2, 3, 4, 5]"); | ||
AssertArraysEqual(*expected_values, *values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this expected behavior or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just commented on the JIRA as well, but so in the current design, we shouldn't change that. It is Flatten()
that already implements this, and the values
being unsliced ensures that they still match the offsets
of a sliced ListArray.
It is certainly confusing, though. I wonder if we should give it a more scary name like "raw_values"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on renaming to raw_values
. Or at the very least, we should modify the doc comment. Right now it's not clear it doesn't account for offset and length.
The problem with Flatten though is that it removes the null values from the values array, but in this case we want them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with Flatten though is that it removes the null values from the values array, but in this case we want them.
Because we use the offsets to slice into the flat values, and those offsets take into account potential values behind a null?
This corner case is covered by existing tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add a test.
And you might be right about flatten, it does include the nulls in the values (not the list). For some reason earlier I had gotten the impression it doesn't, while I was debugging.
>>> import pyarrow as pa
>>> arr = pa.array([[1, 2], [3, 4, 5], [6, None], [7, 8]])
>>> arr.flatten()
<pyarrow.lib.Int64Array object at 0x129b2c820>
[
1,
2,
3,
4,
5,
6,
null,
7,
8
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but your example uses an actual null inside a list, not a "null list". For example, here the first null is not in the flattened output:
In [3]: arr = pa.array([[1, 2], None, [3, None]])
In [4]: arr.flatten()
Out[4]:
<pyarrow.lib.Int64Array object at 0x7f8b1fbcaa40>
[
1,
2,
3,
null
]
And I suppose behind that null can be any data (the default when constructed above is that there is no data behind, so the offset doesn't increment for that list element):
In [5]: arr.offsets
Out[5]:
<pyarrow.lib.Int32Array object at 0x7f8b4f0f5a80>
[
0,
2,
2,
4
]
It's a bit tricky to construct manually, but something like:
In [10]: arr = pa.ListArray.from_arrays(pa.array([0, 2, 4, 6]), pa.array([1, 2, 99, 99, 3, None]), mask=pa.array([False, True, False]))
In [11]: arr
Out[11]:
<pyarrow.lib.ListArray object at 0x7f8b1fbcba60>
[
[
1,
2
],
null,
[
3,
null
]
]
In [12]: arr.flatten()
Out[12]:
<pyarrow.lib.Int64Array object at 0x7f8b4f065960>
[
1,
2,
3,
null
]
In [13]: arr.values
Out[13]:
<pyarrow.lib.Int64Array object at 0x7f8b4f065780>
[
1,
2,
99,
99,
3,
null
]
In [14]: arr.offsets
Out[14]:
<pyarrow.lib.Int32Array object at 0x7f8b1fbc9300>
[
0,
2,
4,
6
]
But, so I am not sure if for this case you actually need to flattened (with nulls removed) or not for this case of converting to numpy.
The offsets still assume those unused values are present, so it was maybe actually a good call to think the "Flattened" values (with values behind nulls removed) was not the correct method to use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we use the offsets to slice into the flat values, and those offsets take into account potential values behind a null?
Oh now I remember, and I think I understand what you are saying here better now. For the fixed size lists, the values behind a null entry in the list are removed when we call Flatten()
. When we go back to try to reconstruct the lists based on offsets, the offsets produced by value_offset
are all invalid since they don't account for the values we dropped in Flatten()
.
I re-ran the original reproduction and it seems memory usage is no longer quadratic:
Code for testWrite test file: import numpy as np
import random
import string
import tracemalloc
import pyarrow as pa
import pyarrow.parquet as pq
_characters = string.ascii_uppercase + string.digits + string.punctuation
def make_random_string(N=10):
return ''.join(random.choice(_characters) for _ in range(N))
nrows = 256_000
filename = 'nested_pandas.parquet'
arr_len = 10
nested_col = []
for i in range(nrows):
nested_col.append(np.array(
[{
'a': None if i % 1000 == 0 else np.random.choice(10000, size=3).astype(np.int64),
'b': None if i % 100 == 0 else random.choice(range(100)),
'c': None if i % 10 == 0 else make_random_string(5)
} for i in range(arr_len)]
))
table = pa.table({'c1': nested_col})
# table = pa.table({
# 'c1': pa.array([list(range(random.randint(1, 20))) for _ in range(nrows)])
# })
# Writing to .parquet and loading it into arrow again
pq.write_table(table, filename) Then measure: import tracemalloc
import pyarrow.parquet as pq
filename = '/Users/willjones/Documents/arrows/arrow/python/nested_pandas.parquet'
tracemalloc.start()
table_from_parquet = pq.read_table(filename)
out = table_from_parquet.to_pandas()
print(tracemalloc.get_traced_memory()) |
python/pyarrow/tests/test_pandas.py
Outdated
@@ -4513,3 +4513,27 @@ def test_does_not_mutate_timedelta_nested(): | |||
df = table.to_pandas() | |||
|
|||
assert df["timedelta_2"][0].to_pytimedelta() == timedelta_2[0] | |||
|
|||
|
|||
def test_list_no_duplicate_base(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a TestConvertListTypes
class that groups some list type related tests, maybe can move this somewhere there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move it there.
@jorisvandenbossche @wjones127 I think this might be a release blocker. I am happy to mark the issue as a blocker and add it to the release if it gets reviewed / merged |
Yes, agreed it would be nice to include this one in the release, given the quadratic memory issues. I did a review, and all looks good to me. @wjones127 I am just going to add one more test with the "hidden" null values (the reason we can't use Flatten), which I think is currently not covered by the tests (unless it was already covered by existing tests)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you for working on this @wjones127!
Benchmark runs are scheduled for baseline = 705e04b and contender = 2b50694. 2b50694 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |