Skip to content

API: value-dependent behaviour in concat with all-NA data #40893

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

In general, we want to get rid of value-dependent behaviour in concat-operations: the resulting dtype of a concat-operation only depends on the input dtypes, and not on the exact content (the exact values) of the inputs.

This has been discussed in the past on general occasions, eg in #33607 when adding the general EA interface for concat (there is still one value-dependent special case for Categorical involving integer categories / missing values, encoded in core/dtypes/concat.py::cast_to_common_type), or #39122 about this issue when concerning empty series/dataframes.

But so one other case (which came up recently in eg #39574 and #39612) is related to all-NA/NaN objects.

For DataFrames, when there is all-missing column, its type gets ignored when determining the result dtype (which, however, requires inspecting the values of the column). Small example:

>>> df_missing = pd.DataFrame({'a': [np.nan]})
>>> df_dt64 = pd.DataFrame({'a': [pd.Timestamp("2021-01-01")]}, dtype="datetime64[ns]")

>>> pd.concat([df_missing, df_dt64])
           a
0        NaT
0 2021-01-01

>>> pd.concat([df_missing, df_dt64]).dtypes
a    datetime64[ns]
dtype: object

This can be useful, as you can get such object/float dtype columns depending on how those "empty" all-NaN DataFrames are created (eg when constructing a DataFrame with given index/column but without data, or by reindexing the rows of an actual empty DataFrame, or reindexing the columns of a non-empty DataFrame).

However, it does introduce annoying value-dependent behaviour, and is also not very consistent throughout pandas. For example, Series does not check for this, and will actually result in object dtype:

>>> pd.concat([df_missing['a'], df_dt64['a']])
0                    NaN
0    2021-01-01 00:00:00
Name: a, dtype: object

Further, this is also not consistent across data types. For example, we don't check for all-NA for the new nullable dtypes.

For ArrayManager, I didn't yet implement any special case value-dependent behaviour (#39612, so on this aspect it diverges from the BlockManager behaviour), as it would be good to first decide on the desired behaviour long term.

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignDtype ConversionsUnexpected or buggy dtype conversionsNeeds DiscussionRequires discussion from core team before further actionReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions