Skip to content

BUG: GroupBy's quantile incompatible with pd.NA #42849

Closed
@JP-Ellis

Description

@JP-Ellis
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Firstly, here's a working example using the default dtypes:

df = pd.DataFrame({
    "x": [1, 1],
    "y": [0.2, np.nan]
})
print(f"dtypes:\n{df.dtypes}")
display(df.groupby("x")["y"].quantile(0.5))
dtypes:
x      int64
y    float64
dtype: object
x
1    0.2
Name: y, dtype: float64

To contrast, if I use the Pandas dtypes, the codef ails

df = pd.DataFrame({
    "x": [1, 1],
    "y": [0.2, np.nan]
}).astype({"x": pd.Int64Dtype(), "y": pd.Float64Dtype()})
print(f"dtypes:\n{df.dtypes}")
display(df.groupby("x")["y"].quantile(0.5))
dtypes:
x      Int64
y    Float64
dtype: object

TypeError: float() argument must be a string or a number, not 'NAType'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/josh/ipykernel_67869/1525766791.py in <module>
      4 }).astype({"x": pd.Int64Dtype(), "y": pd.Float64Dtype()})
      5 print(f"dtypes:\n{df.dtypes}")
----> 6 display(df.groupby("x")["y"].quantile(0.5))

.../python3.9/site-packages/pandas/core/groupby/groupby.py in quantile(self, q, interpolation)
   2450 
   2451         if is_scalar(q):
-> 2452             return self._get_cythonized_result(
   2453                 "group_quantile",
   2454                 aggregate=True,

.../python3.9/site-packages/pandas/core/groupby/groupby.py in _get_cythonized_result(self, how, cython_dtype, aggregate, numeric_only, needs_counts, needs_values, needs_2d, needs_nullable, min_count, needs_mask, needs_ngroups, result_is_index, pre_processing, post_processing, **kwargs)
   2899                         )
   2900                         continue
-> 2901                 vals = vals.astype(cython_dtype, copy=False)
   2902                 if needs_2d:
   2903                     vals = vals.reshape((-1, 1))

TypeError: float() argument must be a string or a number, not 'NAType'

What is odd is that the quantile function works fine with if we avoid using groupby:

df["y"].quantile(0.5)
0.2

Problem description

I would have expected that change from the default float64 type to pd.Float64Dtype should produce the same results with the NAN values being treated in the same way.

Output of pd.show_versions()


commit : c7f7443
python : 3.9.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.6-zen1-1-zen
Version : #1 ZEN SMP PREEMPT Thu, 29 Jul 2021 00:21:08 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.1.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugGroupbyNA - MaskedArraysRelated to pd.NA and nullable extension arraysRegressionFunctionality that used to work in a prior pandas versionquantilequantile method

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions