-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert result of group by agg to pyarrow if input is pyarrow #58129
Convert result of group by agg to pyarrow if input is pyarrow #58129
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to a particular issue?
I have added the issue in the pr, the pr is still work in progress |
Hi, I have completed the implementation, may i check why linux test if failing with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This appears to me to be a fairly far reaching, and I don't yet feel comfortable given that we have to consider many different cases since the user can provide an arbitrary UDF. It seems to me that the logic "convert to pyarrow dtypes when we can" could result in some surprising behaviors. For example:
df = DataFrame({"A": ["c1", "c2", "c3"], "B": [100, 200, 255]})
df["B"] = df["B"].astype("bool[pyarrow]")
gb = df.groupby("A")
result = gb.agg(lambda x: [1, 2, 3])
print(result["B"].dtype)
# list<item: int64>[pyarrow]
result = gb.agg(lambda x: [1, 2, "a"])
print(result["B"].dtype)
# object
While I experiment with this some more, a few questions.
Thank you for the review. For this example, it is expected as that is how pyarrow represents these data structures. E.g homogenous int list and heterogenous object. Alternatively, what would be the expected dtype in this case? |
8a95274
to
680e238
Compare
ed27650
to
3a8597e
Compare
Thank you for the review, I have updated the pr according to comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two general remarks about the tests:
- Use other pandas methods as little as possible; try to construct what you want directly.
- It looks like many of the tests added here can be parametrized, can you give that a shot.
For the first, you can do things like
df = pd.DataFrame(
{
"a": pd.array([1, 2, 3], dtype="..."),
"b": pd.array([True, False, True], dtype="..."),
},
index=pd.Index([1, 2, 3]),
)
instead of using astype
, set_index
, and the like.
if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct( | ||
out.dtype.pyarrow_dtype | ||
): | ||
out = npvalues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test that hits this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved, the test_agg_lambda_pyarrow_struct_to_object_dtype_conversion
test hits this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel - I was surprised maybe_cast_pointwise_result
was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array
in a followup. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
giving us back a Arrow dtypes we don't have EAs for
Can you give an example? this confuses me.
should maybe go in dtypes.cast._maybe_cast_to_extension_array
_maybe_cast_to_extension_array is only used in maybe_cast_pointwise_result, so not a huge deal either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pandas.core.dtypes.cast import maybe_cast_pointwise_result
arr = np.array([{"number": 1}])
result = maybe_cast_pointwise_result(
arr,
dtype=pd.ArrowDtype(pa.int64()),
numeric_only=True,
same_dtype=False,
)
print(result)
# Length: 1, dtype: struct<number: int64>[pyarrow]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel - sorry for the noise, I was not aware we could support struct dtypes. I think everything is okay here.
@undermyumbrella1 - why go with NumPy object dtype instead of struct dtypes here?
ad15c86
to
712c36a
Compare
Thank you foe the review, I have made changes according to the pr comments. |
if isinstance(out.dtype, ArrowDtype) and pa.types.is_struct( | ||
out.dtype.pyarrow_dtype | ||
): | ||
out = npvalues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel - I was surprised maybe_cast_pointwise_result
was giving us back a Arrow dtypes we don't have EAs for. I'm thinking the logic here to prevent this should maybe go in dtypes.cast._maybe_cast_to_extension_array
in a followup. Any thoughts?
pandas/tests/extension/test_arrow.py
Outdated
if pa.types.is_date(pa_dtype): | ||
return "date32[day][pyarrow]" | ||
elif pa.types.is_time(pa_dtype): | ||
return "time64[us][pyarrow]" | ||
elif pa.types.is_decimal(pa_dtype): | ||
return ArrowDtype(pa.decimal128(4, 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On closer look, I think this is a bug being introduced here. This test is using .first()
, it should be preserving the dtype in all cases. The changes in this PR now ignore the preserve_dtype
argument of agg_series
. When that is true, we should be passing same_dtype=True
to maybe_cast_pointwise_result
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
…b.com:undermyumbrella1/pandas into fix/group_by_agg_pyarrow_bool_numpy_same_type
pandas/tests/extension/test_arrow.py
Outdated
@@ -1125,6 +1125,27 @@ def test_comp_masked_numpy(self, masked_dtype, comparison_op): | |||
expected = pd.Series(exp, dtype=ArrowDtype(pa.bool_())) | |||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_groupby_agg_extension(self, data_for_grouping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test should behave the same as the one in the base class. If that's the case, this can be removed. Can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
I think merging main once more should resolve the CI issues. |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
I plan to push this across the finish line. |
Ah sorry, I’ll get to it this weekend
…On Thu, 20 Jun 2024 at 4:49 AM, Richard Shadrach ***@***.***> wrote:
I plan to push this across the finish line.
—
Reply to this email directly, view it on GitHub
<#58129 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A4UEHW5KQ6INWZS7EBRXLFLZIHVFVAVCNFSM6AAAAABFVB4M4WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGQZTSOBZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Root cause:
agg_series
always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be differentFix:
maybe_convert_object
, asmaybe_convert_object
does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),