Skip to content

BUG/API: sum of a string column with all-NaN or empty #60229

Open
@jorisvandenbossche

Description

We decided to allow the sum operation for the future string dtype (PR in #59853, based on discussion in #59328).

But I ran into a strange case in groupby where the end result contains "0" in case of an empty or all-NaN group.

Reproducible example:

df = pd.DataFrame(
    {
        "key": [1, 2, 2, 3, 3, 3],
        "col1": [np.nan, 2, np.nan, 4, 5, 6],
        "col2": [np.nan, "b", np.nan, "d", "e", "f"],
    }
)
result = df.groupby("key").sum()

Currently, you get this:

>>> result
     col1 col2
key           
1     0.0    0
2     2.0    b
3    15.0  def

>>> result["col2"].values
array([0, 'b', 'def'], dtype=object)

So the "sum" operation has introduced a 0. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.

However, with pd.options.future.infer_string = True enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:

>>> result["col2"].values
<ArrowStringArrayNumpySemantics>
['0', 'b', 'def']
Length: 3, dtype: str

So the integer 0 has been converted to a string.

I think we certainly should not introduce this "0" string, and returning object dtype with 0 is also quite useless I think (but at least not inventing a new string in your data).
But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?

cc @rhshadrach @WillAyd @Dr-Irv

Metadata

Assignees

Labels

BugGroupbyStringsString extension data type and string data

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions