Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
data = {
'column_A_1': ['A', 'B', 'A', None, 'D', 'B', 'A'],
'column_A_2': ['G', 'F', 'J', 'J', 'J', 'F', 'G'],
'column_A_3': ['6602', '7059', '9805', '3080', '8625', '5741', '9685'],
'column_A_4': ['A', 'B', 'A', None, 'A', None, 'B'],
'column_A_4': ['X', None, 'Y', None, 'Z', 'X', 'Y'],
'column_B_1': ['1', '2', '3', '4', '5', '6', '7'],
'column_C_1': [0, 2, 5, 9, 8, 3, 7],
'column_C_2': [12, 75, None, 93, 89, 23, 97],
'column_C_3': [789, 102, 425, 895, None, 795, None],
'column_C_3': [15886, 49828, None, 9898, 8085, 9707, 8049]
}
df = pd.DataFrame(data)
pd.pivot_table(df, index=['column_A_1', 'column_A_2', 'column_A_3', 'column_A_4'], columns=['column_B_1'], values=['column_C_1', 'column_C_2', 'column_C_3'], aggfunc={'column_C_1': 'max', 'column_C_2': 'min', 'column_C_3': 'count'}, dropna=False, margins=False, observed=True)
Issue Description
I have a huge dataset with similar structure to the example. I want to pivot the table grouping using the columns A as the index, the values of the columns B as the new columns and aggregate the values of the columns C. I want all columns B values to appear as columns, even if the entire column is NaN. This is because I want to coalesce values from multiple columns into one. Therefore, the parameter dropna should be equal to False. But the DataFrame I get has 336 rows with impossible combinations. For example, the first row A, F, 3080, X has the entire row filled with NaNs since this combination does not exist.
This is a problem because with a small dataset I wouldn't mind. But with a fairly large dataset, numpy returns an error because it has reached the maximum list size. While reading the documentation, I noticed the parameter:
I thought this parameter fixed this issue. Playing around with this parameter, it does not affect the result, it only adds a row. Here is a result of combining these two parameters.
dropna=False, margins=False (Too many rows)
dropna=True, margins=False (Missing Column B values)
dropna=False, margins=True (Same as dropna=False, margins=False?)
dropna=True, margins=True (Same as dropna=True, margins=False?)
I also noticed this parameter:
But it is deprecated, and the default value of True seems to be the value that I need. Forcing this parameter to True does not change the result.
Expected Behavior
I expect with the parameter's combination dropna=False, margins=False and observed=True to get all the rows with plausible combinations (like if I was grouping by) and all the columns with column B values and columns C values.
I don't know if this is a bug or if it is the intended way for the pivot table to work and this is an enhancement.
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.10.6
python-bits : 64
OS : Linux
OS-release : 5.10.235-227.919.amzn2.x86_64
Version : #1 SMP Sat Apr 5 16:59:05 UTC 2025
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 2.2.3
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
pip : 24.0
Cython : None
sphinx : 7.2.6
IPython : 8.23.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.3
lxml.etree : 5.1.0
matplotlib : 3.8.4
numba : 0.59.1
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 15.0.2
pyreadstat : None
pytest : 8.1.1
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : 2.0.29
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
xlsxwriter : None
zstandard : 0.22.0
tzdata : 2024.1
qtpy : 2.4.1
pyqt5 : None