Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert result of group by agg to pyarrow if input is pyarrow #58129

Conversation

undermyumbrella1
Copy link
Contributor

@undermyumbrella1 undermyumbrella1 commented Apr 3, 2024

Root cause:

  • agg_series always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different

Fix:

  • replace all NA with nan
  • convert the `results' to respective pyarrow extension array, using pyarrow library methods
  • pyarrow library methods is used instead of maybe_convert_object, as maybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to a particular issue?

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Arrow pyarrow functionality labels Apr 3, 2024
@undermyumbrella1
Copy link
Contributor Author

I have added the issue in the pr, the pr is still work in progress

@undermyumbrella1 undermyumbrella1 marked this pull request as draft April 5, 2024 07:05
@undermyumbrella1 undermyumbrella1 marked this pull request as ready for review April 8, 2024 17:01
@undermyumbrella1
Copy link
Contributor Author

Hi, I have completed the implementation, may i check why linux test if failing with NameError: name 'pa' is not defined, but it works for other os?

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This appears to me to be a fairly far reaching, and I don't yet feel comfortable given that we have to consider many different cases since the user can provide an arbitrary UDF. It seems to me that the logic "convert to pyarrow dtypes when we can" could result in some surprising behaviors. For example:

df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")

result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]

result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object

While I experiment with this some more, a few questions.

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/groupby/ops.py Outdated Show resolved Hide resolved
@undermyumbrella1
Copy link
Contributor Author

undermyumbrella1 commented Apr 12, 2024

df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")

result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]

result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object

Thank you for the review. For this example, it is expected as that is how pyarrow represents these data structures. E.g homogenous int list and heterogenous object. Alternatively, what would be the expected dtype in this case?

@undermyumbrella1 undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from 8a95274 to 680e238 Compare April 29, 2024 06:57
@undermyumbrella1 undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from ed27650 to 3a8597e Compare April 29, 2024 08:29
@undermyumbrella1
Copy link
Contributor Author

Thank you for the review, I have updated the pr according to comments.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two general remarks about the tests:

  • Use other pandas methods as little as possible; try to construct what you want directly.
  • It looks like many of the tests added here can be parametrized, can you give that a shot.

For the first, you can do things like

df = pd.DataFrame(
    {
        "a": pd.array([1, 2, 3], dtype="..."),
        "b": pd.array([True, False, True], dtype="..."),
    },
    index=pd.Index([1, 2, 3]),
)

instead of using astype, set_index, and the like.

Comment on lines +928 to +931
if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct(
out.dtype.pyarrow_dtype
):
out = npvalues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test that hits this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved, the test_agg_lambda_pyarrow_struct_to_object_dtype_conversion test hits this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - I was surprised maybe_cast_pointwise_result was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array in a followup. Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

giving us back a Arrow dtypes we don't have EAs for

Can you give an example? this confuses me.

should maybe go in dtypes.cast._maybe_cast_to_extension_array

_maybe_cast_to_extension_array is only used in maybe_cast_pointwise_result, so not a huge deal either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from pandas.core.dtypes.cast import maybe_cast_pointwise_result

arr = np.array([{"number": 1}])
result = maybe_cast_pointwise_result(
    arr, 
    dtype=pd.ArrowDtype(pa.int64()), 
    numeric_only=True, 
    same_dtype=False,
)
print(result)
# Length: 1, dtype: struct<number: int64>[pyarrow]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - sorry for the noise, I was not aware we could support struct dtypes. I think everything is okay here.

@undermyumbrella1 - why go with NumPy object dtype instead of struct dtypes here?

pandas/tests/extension/test_arrow.py Outdated Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
@undermyumbrella1 undermyumbrella1 force-pushed the fix/group_by_agg_pyarrow_bool_numpy_same_type branch from ad15c86 to 712c36a Compare May 2, 2024 05:23
@undermyumbrella1
Copy link
Contributor Author

Thank you foe the review, I have made changes according to the pr comments.

Comment on lines +928 to +931
if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct(
out.dtype.pyarrow_dtype
):
out = npvalues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - I was surprised maybe_cast_pointwise_result was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array in a followup. Any thoughts?

Comment on lines 1134 to 1139
if pa.types.is_date(pa_dtype):
return "date32[day][pyarrow]"
elif pa.types.is_time(pa_dtype):
return "time64[us][pyarrow]"
elif pa.types.is_decimal(pa_dtype):
return ArrowDtype(pa.decimal128(4, 3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On closer look, I think this is a bug being introduced here. This test is using .first(), it should be preserving the dtype in all cases. The changes in this PR now ignore the preserve_dtype argument of agg_series. When that is true, we should be passing same_dtype=True to maybe_cast_pointwise_result.

Copy link
Contributor Author

@undermyumbrella1 undermyumbrella1 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@@ -1125,6 +1125,27 @@ def test_comp_masked_numpy(self, masked_dtype, comparison_op):
expected = pd.Series(exp, dtype=ArrowDtype(pa.bool_()))
tm.assert_series_equal(result, expected)

def test_groupby_agg_extension(self, data_for_grouping):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should behave the same as the one in the base class. If that's the case, this can be removed. Can you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@rhshadrach
Copy link
Member

I think merging main once more should resolve the CI issues.

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jun 15, 2024
@rhshadrach
Copy link
Member

I plan to push this across the finish line.

@rhshadrach rhshadrach self-assigned this Jun 19, 2024
@undermyumbrella1
Copy link
Contributor Author

undermyumbrella1 commented Jun 20, 2024 via email

@rhshadrach rhshadrach removed the Stale label Jun 22, 2024
@rhshadrach rhshadrach removed their assignment Jun 22, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Arrow pyarrow functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Groupby-aggregate on a boolean column returns a different datatype with pyarrow than with numpy
4 participants