-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Description
Currently when you align columns and create a new column, align will create a new float64 column filled with NaNs.
In [1]: import pandas as pd
In [2]: a = pd.DataFrame({"A": [1, 2], "B": [pd.Timestamp('2000'), pd.NaT]})
In [3]: b = pd.DataFrame({"A": [1, 2]})
In [4]: a.align(b)[1].dtypes
Out[4]:
A int64
B float64
dtype: objectI think it'd be more useful for the dtypes of new columns to be the same as the dtype from the other.
# proposed behavior
In [4]: a.align(b)[1].dtypes
Out[4]:
A int64
B datetime64[ns]
dtype: objectThe newly created B column has dtype datetime64[ns], the same as a.B.
This proposal would make the fill_value keyword a bit more complex.
- The default of
np.nanwould change toNone, which means "the right NA value for the dtype". - We would maybe need to accept a Mapping so users could specify specific fill values per column.
I think this would make the workaround in #31679 unnecessary, as we'd have the correct dtype going into the operation.
If we think this is a good idea, it's probably an API breaking change. We might be able to deprecate this cleanly by (ab)using fill_value. We would warn when creating new columns.
if new_columns and fill_value is no_default:
warnings.warn("Creating new float64 columns filled with NaN. In the future... "
"Specify fill_value=None to accept the future behavior now.")
fill_value = np.nan # Unfortunately, that'll happen in the background during binops. Not sure how to get around that, aside from instructing users to explicitly align first.