Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-20512: [Python] Numpy conversion doesn't account for ListArray offset #15210

Merged
merged 7 commits into from
Jan 17, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions cpp/src/arrow/array/array_list_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -509,6 +509,18 @@ class TestListArray : public ::testing::Test {
ASSERT_RAISES(Invalid, ValidateOffsets(2, {0, 7, 4}, values));
}

void TestSliced() {
auto arr = ArrayFromJSON(list(int16()), "[[1, 2], [3, 4, 5], [6], [7, 8]]");

auto arr_sliced = arr->Slice(0, 2);
auto expected_sliced = ArrayFromJSON(list(int16()), "[[1, 2], [3, 4, 5]]");
AssertArraysEqual(*expected_sliced, *arr_sliced);

auto values = checked_cast<ListArray*>(arr_sliced.get())->values();
auto expected_values = ArrayFromJSON(int16(), "[1, 2, 3, 4, 5]");
AssertArraysEqual(*expected_values, *values);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this expected behavior or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just commented on the JIRA as well, but so in the current design, we shouldn't change that. It is Flatten() that already implements this, and the values being unsliced ensures that they still match the offsets of a sliced ListArray.

It is certainly confusing, though. I wonder if we should give it a more scary name like "raw_values"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on renaming to raw_values. Or at the very least, we should modify the doc comment. Right now it's not clear it doesn't account for offset and length.

The problem with Flatten though is that it removes the null values from the values array, but in this case we want them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with Flatten though is that it removes the null values from the values array, but in this case we want them.

Because we use the offsets to slice into the flat values, and those offsets take into account potential values behind a null?

This corner case is covered by existing tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add a test.

And you might be right about flatten, it does include the nulls in the values (not the list). For some reason earlier I had gotten the impression it doesn't, while I was debugging.

>>> import pyarrow as pa
>>> arr = pa.array([[1, 2], [3, 4, 5], [6, None], [7, 8]])
>>> arr.flatten()
<pyarrow.lib.Int64Array object at 0x129b2c820>
[
  1,
  2,
  3,
  4,
  5,
  6,
  null,
  7,
  8
]

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but your example uses an actual null inside a list, not a "null list". For example, here the first null is not in the flattened output:

In [3]: arr = pa.array([[1, 2], None, [3, None]])

In [4]: arr.flatten()
Out[4]: 
<pyarrow.lib.Int64Array object at 0x7f8b1fbcaa40>
[
  1,
  2,
  3,
  null
]

And I suppose behind that null can be any data (the default when constructed above is that there is no data behind, so the offset doesn't increment for that list element):

In [5]: arr.offsets
Out[5]: 
<pyarrow.lib.Int32Array object at 0x7f8b4f0f5a80>
[
  0,
  2,
  2,
  4
]

It's a bit tricky to construct manually, but something like:

In [10]: arr = pa.ListArray.from_arrays(pa.array([0, 2, 4, 6]), pa.array([1, 2, 99, 99, 3, None]), mask=pa.array([False, True, False]))

In [11]: arr
Out[11]: 
<pyarrow.lib.ListArray object at 0x7f8b1fbcba60>
[
  [
    1,
    2
  ],
  null,
  [
    3,
    null
  ]
]

In [12]: arr.flatten()
Out[12]: 
<pyarrow.lib.Int64Array object at 0x7f8b4f065960>
[
  1,
  2,
  3,
  null
]

In [13]: arr.values
Out[13]: 
<pyarrow.lib.Int64Array object at 0x7f8b4f065780>
[
  1,
  2,
  99,
  99,
  3,
  null
]

In [14]: arr.offsets
Out[14]: 
<pyarrow.lib.Int32Array object at 0x7f8b1fbc9300>
[
  0,
  2,
  4,
  6
]

But, so I am not sure if for this case you actually need to flattened (with nulls removed) or not for this case of converting to numpy.

The offsets still assume those unused values are present, so it was maybe actually a good call to think the "Flattened" values (with values behind nulls removed) was not the correct method to use here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we use the offsets to slice into the flat values, and those offsets take into account potential values behind a null?

Oh now I remember, and I think I understand what you are saying here better now. For the fixed size lists, the values behind a null entry in the list are removed when we call Flatten(). When we go back to try to reconstruct the lists based on offsets, the offsets produced by value_offset are all invalid since they don't account for the values we dropped in Flatten().

}

void TestCornerCases() {
// ARROW-7985
ASSERT_OK(builder_->AppendNull());
Expand Down Expand Up @@ -601,6 +613,8 @@ TYPED_TEST(TestListArray, TestFlattenNonEmptyBackingNulls) {

TYPED_TEST(TestListArray, ValidateOffsets) { this->TestValidateOffsets(); }

TYPED_TEST(TestListArray, TestSliced) { this->TestSliced(); }

TYPED_TEST(TestListArray, CornerCases) { this->TestCornerCases(); }

#ifndef ARROW_LARGE_MEMORY_TESTS
Expand Down
1 change: 1 addition & 0 deletions python/pyarrow/src/arrow/python/arrow_to_pandas.cc
Original file line number Diff line number Diff line change
Expand Up @@ -737,6 +737,7 @@ Status ConvertListsLike(PandasOptions options, const ChunkedArray& data,
// Get column of underlying value arrays
ArrayVector value_arrays;
for (int c = 0; c < data.num_chunks(); c++) {
// Values does not account for offsets
const auto& arr = checked_cast<const ListArrayT&>(*data.chunk(c));
if (arr.value_type()->id() == Type::EXTENSION) {
const auto& arr_ext = checked_cast<const ExtensionArray&>(*arr.values());
Expand Down
13 changes: 13 additions & 0 deletions python/pyarrow/tests/test_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -4513,3 +4513,16 @@ def test_does_not_mutate_timedelta_nested():
df = table.to_pandas()

assert df["timedelta_2"][0].to_pytimedelta() == timedelta_2[0]


def test_list_only_once():
breakpoint()
arr = pa.array([[1, 2], [3, 4, 5], [6], [7, 8]])
chunked_arr = pa.chunked_array([arr.slice(0, 2), arr.slice(2, 2)])

# converting this chunked array to numpy
np_arr = chunked_arr.to_numpy()

expected_base = np.array([[1, 2, 3, 4, 5, 6, 7, 8]])
assert np_arr[0].base == expected_base
assert arr.to_numpy(zero_copy_only=False)[0].base == expected_base