Skip to content

Enable zero-copy to_dataframe #9792

Open
@rabernat

Description

@rabernat

What is your issue?

Calling Dataset.to_dataframe() currently always produces a memory copy of all arrays. This is definitely not optimal for all scenarios. We should make it possible to convert Xarray objects to Pandas objects without a memory copy.

This behavior may depend on Pandas version. As of 2.2, here are the relevant Pandas docs: https://pandas.pydata.org/docs/user_guide/copy_on_write.html

Here's the key point:

Constructors now copy NumPy arrays by default

The Series and DataFrame constructors will now copy NumPy array by default when not otherwise specified. This was changed to avoid mutating a pandas object when the NumPy array is changed inplace outside of pandas. You can set copy=False to avoid this copy.

When we construct DataFrames in Xarray, we do it like this

xarray/xarray/core/dataset.py

Lines 7386 to 7388 in d5f84dd

broadcasted_df = pd.DataFrame(
dict(zip(non_extension_array_columns, data, strict=True)), index=index
)

Here's a minimal example

import numpy as np
import xarray as xr
ds = xr.DataArray(np.ones(1_000_000), dims=('x',), name="foo").to_dataset()
df = ds.to_dataframe()
print(np.shares_memory(df.foo.values, ds.foo.values))  # -> False

# can see the memory locations
print(ds.foo.values.__array_interface__)
print(df.foo.values.__array_interface__)

# compare to this
df2 = pd.DataFrame(
    {
        "foo": ds.foo.values,
    },
    copy=False
)
np.shares_memory(df2.foo.values, ds.foo.values)  # -> True

Solution

I propose we add a copy keyword option to Dataset.to_dataframe() (and similar for DataArray) which defaults to False (current behavior) but allows users to select True if that's what they want.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions