BUG/API: sum of a string column with all-NaN or empty #60229
Description
We decided to allow the sum
operation for the future string dtype (PR in #59853, based on discussion in #59328).
But I ran into a strange case in groupby where the end result contains "0"
in case of an empty or all-NaN group.
Reproducible example:
df = pd.DataFrame(
{
"key": [1, 2, 2, 3, 3, 3],
"col1": [np.nan, 2, np.nan, 4, 5, 6],
"col2": [np.nan, "b", np.nan, "d", "e", "f"],
}
)
result = df.groupby("key").sum()
Currently, you get this:
>>> result
col1 col2
key
1 0.0 0
2 2.0 b
3 15.0 def
>>> result["col2"].values
array([0, 'b', 'def'], dtype=object)
So the "sum" operation has introduced a 0
. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.
However, with pd.options.future.infer_string = True
enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:
>>> result["col2"].values
<ArrowStringArrayNumpySemantics>
['0', 'b', 'def']
Length: 3, dtype: str
So the integer 0
has been converted to a string.
I think we certainly should not introduce this "0"
string, and returning object dtype with 0
is also quite useless I think (but at least not inventing a new string in your data).
But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?