Skip to content

ERR: consistent error messages for unsupported reduction operations #59580

Open
@jorisvandenbossche

Description

@jorisvandenbossche

While working on and reviewing some of the string dtype work, we typically have to update some matched error messages in the tests (as there is a new dtype), and that made it clear we really have a variety of ways to phrase the same message.

Focusing specifically on reductions for a moment, some of the variations that are currently being used (all TypeError):

# plain Series
datetime64 type does not support operation 'any'
'DatetimeArray' with dtype datetime64[ns] does not support operation 'sum'
Accumulation cumsum not supported for <class 'pandas.core.arrays.datetimes.DatetimeArray'>
cumprod not supported for Timedelta.
(NotImplementedError) cannot perform cummin with type interval[float64, right]
Cannot perform reduction 'any' with string dtype
cannot perform cummin with type string
# Series groupby
agg function failed [how->mean,dtype->object]
cummin is not supported for object dtype
'quantile' cannot be performed against 'object' dtypes!
datetime64 type does not support operation 'sum'
Period type does not support sum operations
'std' and 'sem' are not valid for PeriodDtype
'std' and 'sem' are not valid for PeriodDtype
Cannot use quantile with bool dtype
category type does not support sum operations
category dtype does not support aggregation 'mean'
import numpy as np

import pandas as pd
from pandas import Index, CategoricalIndex, IntervalIndex

# from conftest.py
indices_dict = {
    "string-object": Index([f"pandas_{i}" for i in range(10)], dtype=object),
    "datetime": pd.date_range("2020-01-01", periods=10),
    "datetime-tz": pd.date_range("2020-01-01", periods=10, tz="US/Pacific"),
    "period": pd.period_range("2020-01-01", periods=10, freq="D"),
    "timedelta": pd.timedelta_range(start="1 day", periods=10, freq="D"),
    "range": pd.RangeIndex(10),
    "int8": Index(np.arange(10), dtype="int8"),
    "int16": Index(np.arange(10), dtype="int16"),
    "int32": Index(np.arange(10), dtype="int32"),
    "int64": Index(np.arange(10), dtype="int64"),
    "uint8": Index(np.arange(10), dtype="uint8"),
    "uint16": Index(np.arange(10), dtype="uint16"),
    "uint32": Index(np.arange(10), dtype="uint32"),
    "uint64": Index(np.arange(10), dtype="uint64"),
    "float32": Index(np.arange(10), dtype="float32"),
    "float64": Index(np.arange(10), dtype="float64"),
    "bool-object": Index([True, False] * 5, dtype=object),
    "bool-dtype": Index([True, False] * 5, dtype=bool),
    "complex64": Index(
        np.arange(10, dtype="complex64") + 1.0j * np.arange(10, dtype="complex64")
    ),
    "complex128": Index(
        np.arange(10, dtype="complex128") + 1.0j * np.arange(10, dtype="complex128")
    ),
    "categorical": CategoricalIndex(list("abcd") * 2),
    "interval": IntervalIndex.from_breaks(np.linspace(0, 100, num=11)),
    "empty": Index([]),
    "nullable_int": Index(np.arange(10), dtype="Int64"),
    "nullable_uint": Index(np.arange(10), dtype="UInt16"),
    "nullable_float": Index(np.arange(10), dtype="Float32"),
    "nullable_bool": Index(np.arange(10).astype(bool), dtype="boolean"),
    "string-python": Index(
        pd.array([f"pandas_{i}" for i in range(10)], dtype="string[python]")
    ),
    "string-pyarrow": Index(pd.array([f"pandas_{i}" for i in range(10)], dtype="string[pyarrow]"))
}

for dtype, data in indices_dict.items():
    for op in ["any", "all", "min", "max", "sum", "mean", "median", "prod",
                "std", "var", "sem", "kurt", "skew", "cummin", "cummax", "cumsum",
                    "cumprod", "quantile"]:
        try:
            getattr(pd.Series(data), op)()
        except Exception as e:
            print(dtype, op, type(e), e)


for dtype, data in indices_dict.items():
    for op in ["any", "all", "min", "max", "sum", "mean", "median", "prod",
                "std", "var", "sem", "kurt", "skew", "cummin", "cummax", "cumsum",
                    "cumprod", "quantile"]:
        try:
            getattr(pd.Series(data).groupby([0]*len(data)), op)()
        except Exception as e:
            print(dtype, op, type(e), e)



I think it would be useful for both us maintainers/contributors (consistency in the code base, easier to test) as users (clear and consistent message) to harmonize those error messages.

For a single message, I think I certainly want to specify the dtype (and not the array class), and I think it would be useful to use a bit of quoting to clearly distinguish the operation (and potentially dtype).
But no strong opinion on the actual wording. Some potential suggestions for a single dtype/operation:

  1. dtype 'datetime64[ns]' does not support operation 'sum'
  2. 'datetime64[ns]' dtype does not support operation 'sum'
  3. operation 'sum' is not supported for dtype 'datetime64[ns]'
  4. cannot perform reduction 'sum' with 'datetime64[ns]' dtype
  5. cannot use 'sum' with 'datetime64[ns]' dtype

(could also be all without quotes around the dtype)

Any preferences?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions