BUG/API: sum of a string column with all-NaN or empty

We decided to allow the `sum` operation for the future string dtype (PR in https://github.com/pandas-dev/pandas/pull/59853/, based on discussion in https://github.com/pandas-dev/pandas/issues/59328). 

But I ran into a strange case in groupby where the end result contains `"0"` in case of an empty or all-NaN group.

Reproducible example:

```python
df = pd.DataFrame(
    {
        "key": [1, 2, 2, 3, 3, 3],
        "col1": [np.nan, 2, np.nan, 4, 5, 6],
        "col2": [np.nan, "b", np.nan, "d", "e", "f"],
    }
)
result = df.groupby("key").sum()
```

Currently, you get this:

```python
>>> result
     col1 col2
key           
1     0.0    0
2     2.0    b
3    15.0  def

>>> result["col2"].values
array([0, 'b', 'def'], dtype=object)
```

So the "sum" operation has introduced a `0`. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.

However, with `pd.options.future.infer_string = True` enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:

```python
>>> result["col2"].values
<ArrowStringArrayNumpySemantics>
['0', 'b', 'def']
Length: 3, dtype: str
```

So the integer `0` has been converted to a string.

I think we certainly should not introduce this `"0"` string, and returning object dtype with `0` is also quite useless I think (but at least not inventing a new string in your data). 
But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?

cc @rhshadrach @WillAyd @Dr-Irv 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG/API: sum of a string column with all-NaN or empty #60229

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

BUG/API: sum of a string column with all-NaN or empty #60229

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions