DEPR/BUG: DataFrame.stack including all null rows when stacking multiple levels #53515
Description
Ref: #53094 (comment)
From the link above, DataFrame.stack
with dropna=False
will include column combinations that did not exist in the original DataFrame; the values are necessarily all null:
>>> df = pd.DataFrame(np.arange(6).reshape(2,3), columns=pd.MultiIndex.from_tuples([('A','x'), ('A','y'), ('B','z')], names=['Upper', 'Lower']))
>>> df
Upper A B
Lower x y z
0 0 1 2
1 3 4 5
>>> df.stack(level=[0,1], dropna=False)
Upper Lower
0 A x 0.0
y 1.0
z NaN
B x NaN
y NaN
z 2.0
1 A x 3.0
y 4.0
z NaN
B x NaN
y NaN
z 5.0
dtype: float64
I do not think it would be expected for this operation to essentially invent combinations of the stacked columns that did not exist in the original DataFrame. I propose we deprecate this behavior, and only level combinations that occur in the original DataFrame should be included in the result.
The current implementation of DataFrame.stack does not appear easily modified to enforce this deprecation. The following function (which is a bit ugly and could use quite some refinement) implements stack that would be enforced upon deprecation. The implementation is only for the case of multiple levels (so the input must have a MultiIndex).
Implementation of new_stack
def new_stack(df, levels):
stack_cols = (
df.columns
.droplevel([k for k in range(df.columns.nlevels) if k not in levels])
)
_, taker = np.unique(levels, return_inverse=True)
if len(levels) > 1:
# Arrange columns in the order we want to take them
ordered_stack_cols = stack_cols.reorder_levels(taker)
else:
ordered_stack_cols = stack_cols
buf = []
for idx in stack_cols.unique():
if not isinstance(idx, tuple):
idx = (idx,)
# Take the data from df corresponding to this idx value
gen = iter(idx)
column_indexer = tuple(
next(gen) if k in levels else slice(None)
for k in range(df.columns.nlevels)
)
data = df.loc[:, column_indexer]
# When len(levels) == df.columns.nlevels, we're stacking all columns
# and end up with a Series
if len(levels) < df.columns.nlevels:
data.columns = data.columns.droplevel(levels)
buf.append(data)
result = pd.concat(buf)
# Construct the correct MultiIndex by combining the input's index and
# stacked columns.
if isinstance(df.index, MultiIndex):
index_levels = [level.unique() for level in df.index.levels]
else:
index_levels = [
df.index.get_level_values(k).values for k in range(df.index.nlevels)
]
column_levels = [
ordered_stack_cols.get_level_values(e).unique()
for e in range(ordered_stack_cols.nlevels)
]
if isinstance(df.index, MultiIndex):
index_codes = np.tile(df.index.codes, (1, len(result) // len(df)))
else:
index_codes = np.tile(np.arange(len(df)), (1, len(result) // len(df)))
index_codes = [e for e in index_codes]
if isinstance(stack_cols, MultiIndex):
column_codes = ordered_stack_cols.drop_duplicates().codes
else:
column_codes = [np.arange(stack_cols.nunique())]
column_codes = [np.repeat(codes, len(df)) for codes in column_codes]
index_names = df.index.names
column_names = list(ordered_stack_cols.names)
result.index = pd.MultiIndex(
levels=index_levels + column_levels,
codes=index_codes + column_codes,
names=index_names + column_names,
)
# sort result, but faster than calling sort_index since we know the order we need
len_df = len(df)
n_uniques = len(ordered_stack_cols.unique())
idxs = (
np.tile(len_df * np.arange(n_uniques), len_df)
+ np.repeat(np.arange(len_df), n_uniques)
)
result = result.take(idxs)
return result
It is not as quite more performant on two/three levels with NumPy dtypes
size = 100000
df = pd.DataFrame(
np.arange(3 * size).reshape(size, 3),
columns=pd.MultiIndex.from_tuples(
[('A','x', 'a'), ('A', 'y', 'a'), ('B','z', 'b')],
names=['l1', 'l2', 'l3'],
),
dtype="int64",
)
%timeit new_stack(df, [0, 1])
%timeit df.stack(['l1', 'l2'], sort=False)
# 10.6 ms ± 77 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 22.1 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit new_stack(df, [0, 1, 2])
%timeit df.stack(['l1', 'l2', 'l3'], sort=False)
# 5.69 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 34.1 ms ± 315 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
But And is more performant on small data (for what it's worth); this is size=1000
with the same code above:
%timeit new_stack(df, [0, 1])
%timeit df.stack(['l1', 'l2'], sort=False)
1.56 ms ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
5.01 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit new_stack(df, [0, 1, 2])
%timeit df.stack(['l1', 'l2', 'l3'], sort=False)
908 µs ± 2.86 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
5.25 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and it more efficient when working with nullable arrays; this is with size=100000
and dtype="Int64"
:
%timeit new_stack(df, [0, 1, 2])
%timeit df.stack(['l1', 'l2', 'l3'], sort=False)
5.9 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
127 ms ± 961 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit new_stack(df, [0, 1, 2])
%timeit df.stack(['l1', 'l2', 'l3'], sort=False)
5.93 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
131 ms ± 526 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In addition, it does not coerce dtypes unnecessarily as the current implementation does (ref: #51059, #17886). This implementation would close those two issues. E.g. this is from the top example (with dropna=True
) that results in float64 today:
Upper Lower
0 A x 0
y 1
B z 2
1 A x 3
y 4
B z 5
dtype: int64
In particular, when operating on Int64 dtypes, the current implementation results in object dtype whereas this implementation results in Int64.