Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert result of group by agg to pyarrow if input is pyarrow #58129

Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
9faa460
Set preserve_dtype flag for bool type only when result is also bool
Apr 1, 2024
969d5b1
Update implementation to change type to pyarrow only
Apr 2, 2024
66114f3
Change import order
Apr 2, 2024
b0290ed
Convert numpy array to pandas representation of pyarrow array
Apr 3, 2024
20c8fa0
Add tests
Apr 3, 2024
97b3d54
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
Apr 3, 2024
932d737
Change pyarrow to optional import in agg_series() method
Apr 5, 2024
82ddeb5
Seperate tests
Apr 5, 2024
d510052
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
Apr 5, 2024
62a31d9
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
Apr 8, 2024
a54bf58
Revert to old implementation
Apr 8, 2024
64330f0
Update implementation to use pyarrow array method
Apr 8, 2024
0647711
Update test_aggregate tests
Apr 8, 2024
affde38
Move pyarrow import to top of method
Apr 8, 2024
842f561
Update according to pr comments
Apr 12, 2024
93b5bf3
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
Apr 20, 2024
6f35c0e
Fallback convert to input dtype is output is all nan or empty array
Apr 20, 2024
abd0adf
Strip na values when inferring pyarrow dtype
Apr 20, 2024
bebc442
Update tests to check expected inferred dtype instead of inputy dtype
Apr 20, 2024
bb6343b
Override test case for test_arrow.py
Apr 21, 2024
3a3f2a2
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
Apr 21, 2024
6dc40f5
Empty commit to trigger build run
Apr 21, 2024
4ef96f7
In agg series, convert to np values, then cast to pyarrow dtype, acco…
Apr 23, 2024
c6a98c0
Update tests
Apr 23, 2024
9181eaf
Update rst docs
Apr 25, 2024
612d7d0
Update impl to fix tests
Apr 25, 2024
3b6696b
Declare variable in outer scope
Apr 25, 2024
680e238
Update impl to use maybe_cast_pointwise_result instead of maybe_cast…
Apr 29, 2024
3a8597e
Fix tests with nested array
Apr 29, 2024
6496b15
Update according to pr comments
May 2, 2024
712c36a
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
May 2, 2024
e1ccef6
Preserve_dtype if argument is passed in, else don't preserve
May 7, 2024
0ce083d
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
undermyumbrella1 May 7, 2024
a1d73f5
Update tests
May 7, 2024
57845a8
Merge branch 'fix/group_by_agg_pyarrow_bool_numpy_same_type' of githu…
May 7, 2024
fa257b0
Remove redundant tests
undermyumbrella1 May 12, 2024
0a9b83f
Merge branch 'main' into fix/group_by_agg_pyarrow_bool_numpy_same_type
undermyumbrella1 May 12, 2024
139319a
retrigger pipeline
undermyumbrella1 May 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Convert numpy array to pandas representation of pyarrow array
  • Loading branch information
Kei committed Apr 3, 2024
commit b0290ed659060969325c6eb10e2c9cfa5011fba2
4 changes: 3 additions & 1 deletion pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -2546,7 +2546,7 @@ def maybe_convert_objects(ndarray[object] objects,
if not convert_non_numeric:
seen.object_ = True
break
elif util.is_nan(val):
elif util.is_nan(val) or is_matching_na(val, C_NA):
seen.nan_ = True
mask[i] = True
if util.is_complex_object(val):
Expand All @@ -2555,6 +2555,8 @@ def maybe_convert_objects(ndarray[object] objects,
seen.complex_ = True
if not convert_numeric:
break
elif is_matching_na(val, C_NA):
floats[i] = complexes[i] = fnan
else:
floats[i] = complexes[i] = val
elif util.is_bool_object(val):
Expand Down
27 changes: 18 additions & 9 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -918,20 +918,29 @@ def agg_series(
np.ndarray or ExtensionArray
"""

result = self._aggregate_series_pure_python(obj, func)
npvalues = lib.maybe_convert_objects(result, try_float=False)

if isinstance(obj._values, ArrowExtensionArray):
# convert to pyarrow extension
pyarrow_dtype = pa.from_numpy_dtype(npvalues.dtype)
pandas_pyarrow_dtype = ArrowDtype(pyarrow_dtype)
out = pd_array(npvalues, dtype=pandas_pyarrow_dtype)
elif not isinstance(obj._values, np.ndarray) or preserve_dtype:
if not isinstance(obj._values, np.ndarray) and not isinstance(
obj._values, ArrowExtensionArray
):
# we can preserve a little bit more aggressively with EA dtype
# because maybe_cast_pointwise_result will do a try/except
# with _from_sequence. NB we are assuming here that _from_sequence
# is sufficiently strict that it casts appropriately.
preserve_dtype = True

result = self._aggregate_series_pure_python(obj, func)

npvalues = lib.maybe_convert_objects(result, try_float=False)
if preserve_dtype:
out = maybe_cast_pointwise_result(npvalues, obj.dtype, numeric_only=True)
elif (
isinstance(obj._values, ArrowExtensionArray)
and npvalues.dtype != np.dtype("object")
and npvalues.dtype != np.dtype("complex128")
):
pyarrow_dtype = pa.from_numpy_dtype(npvalues.dtype)
pandas_pyarrow_dtype = ArrowDtype(pyarrow_dtype)
out = pd_array(npvalues, dtype=pandas_pyarrow_dtype)

else:
out = npvalues
return out
Expand Down