Skip to content

Regression in DataArrays created from Pandas #10301

@richard-berg

Description

@richard-berg

What happened?

Given:

index1 = np.array([1, 2, 3])
index2 = np.array([1, 2, 4])
srs = pd.Series(index=index1, data=1).convert_dtypes()
arr = srs.to_xarray()

Now consider:

>>> arr.reindex(index=index2)

In xarray 2023.1.0 this gave a reasonable (if weakly-typed) result.

<xarray.DataArray (index: 3)>
array([1, 1, nan], dtype=object)
Coordinates:
  * index    (index) int64 1 2 4

While upgrading to xarray 2025.3.x + pandas 2.x, my colleagues found it now raises:

TypeError: Cannot interpret 'Int64Dtype()' as a data type

What did you expect to happen?

Ideally, the result would be:

<xarray.DataArray (index: 3)> Size: 27B
PandasExtensionArray(array=<IntegerArray>
[1, 1, <NA>]
Length: 3, dtype: Int64)
Coordinates:
  * index    (index) <U1 12B '1' '2' '4'

Minimal Complete Verifiable Example

import numpy as np
import pandas as pd
import xarray as xr
index1 = np.array([1, 2, 3])
index2 = np.array([1, 2, 4])
srs = pd.Series(index=index1, data=1).convert_dtypes()
arr = srs.to_xarray()
arr.reindex(index=index2)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Anything else we need to know?

The difference is that arr.dtype is now pd.Int64Dtype() rather than np.dtype("object"), thanks to #8723. While arguably an improvement in typing, the xarray core doesn't seem ready to handle the former. In this case, core.dtypes.maybe_promote() is blindly passing a Pandas dtype to np.issubdtype, oops.

Patching this immediate issue is more revealing: reindex then fails when duck_array_ops.where(condition, x, y) tries to coerce x & y to a common dtype. The new extension-array code in as_shared_dtype is not at all general: when y is a scalar (the fill_value from the reindex operation), it simply gives up.

Once I understood the cause of the reindex issue above, producing more -- and much more worrisome -- failures was trivial:

>>> arr + 5

TypeError: unsupported operand type(s) for +: 'PandasExtensionArray' and 'int'

>>> np.add(arr, 5)

TypeError: 'PandasExtensionArray' object is not callable

>>> arr.fillna(0)

AttributeError: 'int' object has no attribute 'dtype'

I'd venture to say that the pandas df.to_xarray() / srs.to_xarray() methods have become foot-guns, bordering on unusable, now that pandas 2.x has reimplemented all of its native datatypes on top of ExtensionArray / ExtensionDtype.

The good news is I have a fix. The bad news is it's pretty invasive, needing careful oversight from someone who actually knows what they're doing. (Before this week I'd never used xarray, nor looked at the numpy / pandas source code.)

For now I might recommend excluding ALL numeric dtypes from being promoted to duck arrays, similar to what #9042 did for datetimes. (Basically everything except Categoricals, which seem to be the one extension type with good coverage in the xarray test suite, and which don't support the vast majority of ufuncs regardless.) That would at least allow people to safely continue using to_xarray() on modern versions of pandas, though you'd lose all the speed & type safety that @ilan-gold worked to achieve in 2024.5 & onward.

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-372.32.1.el8_6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: None

xarray: 2025.3.1
pandas: 2.2.3
numpy: 1.26.4
scipy: 1.15.2
netCDF4: None
pydap: None
h5netcdf: 1.6.1
h5py: 3.9.0
zarr: 3.0.6
cftime: None
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2025.3.0
distributed: 2025.3.0
matplotlib: 3.10.1
cartopy: None
seaborn: 0.13.2
numbagg: 0.9.0
fsspec: 2024.9.0
cupy: 13.4.0
pint: None
sparse: 0.16.0
flox: None
numpy_groupies: None
setuptools: 78.1.0
pip: 25.0.1
conda: None
pytest: 8.3.5
mypy: 1.15.0
IPython: 8.35.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugneeds triageIssue that has not been reviewed by xarray team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions