Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: future_stack=True with non-MulitIndex columns #58817

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -473,6 +473,7 @@ Performance improvements
- Performance improvement in :meth:`RangeIndex.reindex` returning a :class:`RangeIndex` instead of a :class:`Index` when possible. (:issue:`57647`, :issue:`57752`)
- Performance improvement in :meth:`RangeIndex.take` returning a :class:`RangeIndex` instead of a :class:`Index` when possible. (:issue:`57445`, :issue:`57752`)
- Performance improvement in :func:`merge` if hash-join can be used (:issue:`57970`)
- Performance improvement in :meth:`DataFrame.stack` when using ``future_stack=True`` and the DataFrame does not have a :class:`MultiIndex` (:issue:`58391`)
- Performance improvement in :meth:`to_hdf` avoid unnecessary reopenings of the HDF5 file to speedup data addition to files with a very large number of groups . (:issue:`58248`)
- Performance improvement in ``DataFrameGroupBy.__len__`` and ``SeriesGroupBy.__len__`` (:issue:`57595`)
- Performance improvement in indexing operations for string dtypes (:issue:`56997`)
Expand Down
37 changes: 24 additions & 13 deletions pandas/core/reshape/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -934,7 +934,20 @@ def stack_v3(frame: DataFrame, level: list[int]) -> Series | DataFrame:
[k for k in range(frame.columns.nlevels - 1, -1, -1) if k not in set_levels]
)

result = stack_reshape(frame, level, set_levels, stack_cols)
result: Series | DataFrame
if not isinstance(frame.columns, MultiIndex):
# GH#58817 Fast path when we're stacking the columns of a non-MultiIndex.
# When columns are homogeneous EAs, we pass through object
# dtype but this is still slightly faster than the normal path.
if len(frame.columns) > 0 and frame._is_homogeneous_type:
dtype = frame._mgr.blocks[0].dtype
Comment on lines +942 to +943
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is a more canonical way to get the dtype in this case.

else:
dtype = None
result = frame._constructor_sliced(
frame._values.reshape(-1, order="F"), dtype=dtype
)
else:
result = stack_reshape(frame, level, set_levels, stack_cols)

# Construct the correct MultiIndex by combining the frame's index and
# stacked columns.
Expand Down Expand Up @@ -1016,6 +1029,8 @@ def stack_reshape(
-------
The data of behind the stacked DataFrame.
"""
# non-MultIndex takes a fast path.
assert isinstance(frame.columns, MultiIndex)
# If we need to drop `level` from columns, it needs to be in descending order
drop_levnums = sorted(level, reverse=True)

Expand All @@ -1025,18 +1040,14 @@ def stack_reshape(
if len(frame.columns) == 1:
data = frame.copy(deep=False)
else:
if not isinstance(frame.columns, MultiIndex) and not isinstance(idx, tuple):
# GH#57750 - if the frame is an Index with tuples, .loc below will fail
column_indexer = idx
else:
# Take the data from frame corresponding to this idx value
if len(level) == 1:
idx = (idx,)
gen = iter(idx)
column_indexer = tuple(
next(gen) if k in set_levels else slice(None)
for k in range(frame.columns.nlevels)
)
# Take the data from frame corresponding to this idx value
if len(level) == 1:
idx = (idx,)
gen = iter(idx)
column_indexer = tuple(
next(gen) if k in set_levels else slice(None)
for k in range(frame.columns.nlevels)
)
data = frame.loc[:, column_indexer]

if len(level) < frame.columns.nlevels:
Expand Down
10 changes: 9 additions & 1 deletion pandas/tests/extension/base/reshaping.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
import numpy as np
import pytest

from pandas.core.dtypes.dtypes import NumpyEADtype

import pandas as pd
import pandas._testing as tm
from pandas.api.extensions import ExtensionArray
Expand Down Expand Up @@ -266,7 +268,13 @@ def test_stack(self, data, columns, future_stack):
expected = expected.astype(object)

if isinstance(expected, pd.Series):
assert result.dtype == df.iloc[:, 0].dtype
if future_stack and isinstance(data.dtype, NumpyEADtype):
# GH#58817 future_stack=True constructs the result specifying the dtype
# using the dtype of the input; we thus get the underlying
# NumPy dtype as the result instead of the NumpyExtensionArray
assert result.dtype == df.iloc[:, 0].to_numpy().dtype
else:
assert result.dtype == df.iloc[:, 0].dtype
else:
assert all(result.dtypes == df.iloc[:, 0].dtype)

Expand Down
Loading