Convert result of group by agg to pyarrow if input is pyarrow #58129

undermyumbrella1 · 2024-04-03T10:48:16Z

closes BUG: Groupby-aggregate on a boolean column returns a different datatype with pyarrow than with numpy #53030 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Root cause:

agg_series always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different

Fix:

replace all NA with nan
convert the `results' to respective pyarrow extension array, using pyarrow library methods
pyarrow library methods is used instead of maybe_convert_object, as maybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

mroeschke

Is this related to a particular issue?

pandas/core/groupby/ops.py

pandas/tests/groupby/aggregate/test_aggregate.py

undermyumbrella1 · 2024-04-05T06:49:36Z

I have added the issue in the pr, the pr is still work in progress

undermyumbrella1 · 2024-04-08T17:02:00Z

Hi, I have completed the implementation, may i check why linux test if failing with NameError: name 'pa' is not defined, but it works for other os?

rhshadrach

Thanks for the PR! This appears to me to be a fairly far reaching, and I don't yet feel comfortable given that we have to consider many different cases since the user can provide an arbitrary UDF. It seems to me that the logic "convert to pyarrow dtypes when we can" could result in some surprising behaviors. For example:

df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")

result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]

result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object

While I experiment with this some more, a few questions.

pandas/core/dtypes/cast.py

pandas/core/groupby/ops.py

undermyumbrella1 · 2024-04-12T05:22:34Z

df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")

result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]

result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object

Thank you for the review. For this example, it is expected as that is how pyarrow represents these data structures. E.g homogenous int list and heterogenous object. Alternatively, what would be the expected dtype in this case?

undermyumbrella1 · 2024-04-29T09:54:45Z

Thank you for the review, I have updated the pr according to comments.

rhshadrach

Two general remarks about the tests:

Use other pandas methods as little as possible; try to construct what you want directly.
It looks like many of the tests added here can be parametrized, can you give that a shot.

For the first, you can do things like

df = pd.DataFrame(
    {
        "a": pd.array([1, 2, 3], dtype="..."),
        "b": pd.array([True, False, True], dtype="..."),
    },
    index=pd.Index([1, 2, 3]),
)

instead of using astype, set_index, and the like.

rhshadrach · 2024-04-29T20:47:57Z

pandas/core/groupby/ops.py

+            if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct(
+                out.dtype.pyarrow_dtype
+            ):
+                out = npvalues


Is there a test that hits this?

resolved, the test_agg_lambda_pyarrow_struct_to_object_dtype_conversion test hits this

@jbrockmendel - I was surprised maybe_cast_pointwise_result was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array in a followup. Any thoughts?

giving us back a Arrow dtypes we don't have EAs for

Can you give an example? this confuses me.

should maybe go in dtypes.cast._maybe_cast_to_extension_array

_maybe_cast_to_extension_array is only used in maybe_cast_pointwise_result, so not a huge deal either way.

from pandas.core.dtypes.cast import maybe_cast_pointwise_result arr = np.array([{"number": 1}]) result = maybe_cast_pointwise_result( arr, dtype=pd.ArrowDtype(pa.int64()), numeric_only=True, same_dtype=False, ) print(result) # Length: 1, dtype: struct<number: int64>[pyarrow]

@jbrockmendel - sorry for the noise, I was not aware we could support struct dtypes. I think everything is okay here.

@undermyumbrella1 - why go with NumPy object dtype instead of struct dtypes here?

pandas/tests/extension/test_arrow.py

pandas/tests/groupby/aggregate/test_aggregate.py

undermyumbrella1 · 2024-05-02T07:03:12Z

Thank you foe the review, I have made changes according to the pr comments.

rhshadrach · 2024-05-04T12:22:58Z

pandas/core/groupby/ops.py

+            if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct(
+                out.dtype.pyarrow_dtype
+            ):
+                out = npvalues


@jbrockmendel - I was surprised maybe_cast_pointwise_result was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array in a followup. Any thoughts?

rhshadrach · 2024-05-04T12:52:25Z

pandas/tests/extension/test_arrow.py

+            if pa.types.is_date(pa_dtype):
+                return "date32[day][pyarrow]"
+            elif pa.types.is_time(pa_dtype):
+                return "time64[us][pyarrow]"
+            elif pa.types.is_decimal(pa_dtype):
+                return ArrowDtype(pa.decimal128(4, 3))


On closer look, I think this is a bug being introduced here. This test is using .first(), it should be preserving the dtype in all cases. The changes in this PR now ignore the preserve_dtype argument of agg_series. When that is true, we should be passing same_dtype=True to maybe_cast_pointwise_result.

…b.com:undermyumbrella1/pandas into fix/group_by_agg_pyarrow_bool_numpy_same_type

rhshadrach · 2024-05-08T22:06:43Z

pandas/tests/extension/test_arrow.py

@@ -1125,6 +1125,27 @@ def test_comp_masked_numpy(self, masked_dtype, comparison_op):
        expected = pd.Series(exp, dtype=ArrowDtype(pa.bool_()))
        tm.assert_series_equal(result, expected)

+    def test_groupby_agg_extension(self, data_for_grouping):


I think this test should behave the same as the one in the base class. If that's the case, this can be removed. Can you confirm?

rhshadrach · 2024-05-15T20:31:59Z

I think merging main once more should resolve the CI issues.

github-actions · 2024-06-15T00:06:07Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

rhshadrach · 2024-06-19T20:49:07Z

I plan to push this across the finish line.

undermyumbrella1 · 2024-06-20T02:07:16Z

Ah sorry, I’ll get to it this weekend

…

On Thu, 20 Jun 2024 at 4:49 AM, Richard Shadrach ***@***.***> wrote: I plan to push this across the finish line. — Reply to this email directly, view it on GitHub <#58129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A4UEHW5KQ6INWZS7EBRXLFLZIHVFVAVCNFSM6AAAAABFVB4M4WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGQZTSOBZGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mroeschke · 2024-08-09T17:30:11Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Kei added 5 commits April 1, 2024 19:04

Set preserve_dtype flag for bool type only when result is also bool

9faa460

Update implementation to change type to pyarrow only

969d5b1

Change import order

66114f3

Convert numpy array to pandas representation of pyarrow array

b0290ed

Add tests

20c8fa0

undermyumbrella1 requested review from rhshadrach and WillAyd as code owners April 3, 2024 10:48

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

97b3d54

mroeschke requested changes Apr 3, 2024

View reviewed changes

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved

pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved

mroeschke added Apply Apply, Aggregate, Transform, Map Arrow pyarrow functionality labels Apr 3, 2024

Kei added 3 commits April 5, 2024 14:19

Change pyarrow to optional import in agg_series() method

932d737

Seperate tests

82ddeb5

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

d510052

undermyumbrella1 marked this pull request as draft April 5, 2024 07:05

Kei added 5 commits April 8, 2024 20:41

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

62a31d9

Revert to old implementation

a54bf58

Update implementation to use pyarrow array method

64330f0

Update test_aggregate tests

0647711

Move pyarrow import to top of method

affde38

undermyumbrella1 marked this pull request as ready for review April 8, 2024 17:01

rhshadrach reviewed Apr 10, 2024

View reviewed changes

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved

Kei added 5 commits April 12, 2024 13:36

Update according to pr comments

842f561

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

93b5bf3

Fallback convert to input dtype is output is all nan or empty array

6f35c0e

Strip na values when inferring pyarrow dtype

abd0adf

Update tests to check expected inferred dtype instead of inputy dtype

bebc442

undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from 8a95274 to 680e238 Compare April 29, 2024 06:57

Fix tests with nested array

3a8597e

undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from ed27650 to 3a8597e Compare April 29, 2024 08:29

rhshadrach requested changes Apr 29, 2024

View reviewed changes

Kei added 2 commits May 2, 2024 13:22

Update according to pr comments

6496b15

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

712c36a

undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from ad15c86 to 712c36a Compare May 2, 2024 05:23

rhshadrach requested changes May 4, 2024

View reviewed changes

Kei and others added 4 commits May 7, 2024 12:53

Preserve_dtype if argument is passed in, else don't preserve

e1ccef6

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

0ce083d

Update tests

a1d73f5

Merge branch 'fix/group_by_agg_pyarrow_bool_numpy_same_type' of githu…

57845a8

…b.com:undermyumbrella1/pandas into fix/group_by_agg_pyarrow_bool_numpy_same_type

rhshadrach requested changes May 8, 2024

View reviewed changes

undermyumbrella1 and others added 3 commits May 12, 2024 15:39

Remove redundant tests

fa257b0

Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type

0a9b83f

retrigger pipeline

139319a

github-actions bot added the Stale label Jun 15, 2024

rhshadrach self-assigned this Jun 19, 2024

rhshadrach removed the Stale label Jun 22, 2024

rhshadrach removed their assignment Jun 22, 2024

mroeschke closed this Aug 9, 2024

rhshadrach self-assigned this Aug 10, 2024

rhshadrach mentioned this pull request Aug 25, 2024

BUG: groupby.agg with UDF changing pyarrow dtypes #59601

Open

5 tasks

Uh oh!

Convert result of group by agg to pyarrow if input is pyarrow #58129

Convert result of group by agg to pyarrow if input is pyarrow #58129

Uh oh!

Conversation

undermyumbrella1 commented Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

undermyumbrella1 commented Apr 5, 2024

Uh oh!

undermyumbrella1 commented Apr 8, 2024

Uh oh!

rhshadrach left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

undermyumbrella1 commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

undermyumbrella1 commented Apr 29, 2024

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

undermyumbrella1 commented May 2, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

undermyumbrella1 May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhshadrach commented May 15, 2024

Uh oh!

github-actions bot commented Jun 15, 2024

Uh oh!

rhshadrach commented Jun 19, 2024

Uh oh!

undermyumbrella1 commented Jun 20, 2024 via email

Uh oh!

mroeschke commented Aug 9, 2024

Uh oh!

Uh oh!

undermyumbrella1 commented Apr 3, 2024 •

edited

Loading

rhshadrach left a comment •

edited

Loading

undermyumbrella1 commented Apr 12, 2024 •

edited

Loading

undermyumbrella1 May 7, 2024 •

edited

Loading