Description
What is your issue?
Calling Dataset.to_dataframe()
currently always produces a memory copy of all arrays. This is definitely not optimal for all scenarios. We should make it possible to convert Xarray objects to Pandas objects without a memory copy.
This behavior may depend on Pandas version. As of 2.2, here are the relevant Pandas docs: https://pandas.pydata.org/docs/user_guide/copy_on_write.html
Here's the key point:
Constructors now copy NumPy arrays by default
The Series and DataFrame constructors will now copy NumPy array by default when not otherwise specified. This was changed to avoid mutating a pandas object when the NumPy array is changed inplace outside of pandas. You can set copy=False to avoid this copy.
When we construct DataFrames in Xarray, we do it like this
Lines 7386 to 7388 in d5f84dd
Here's a minimal example
import numpy as np
import xarray as xr
ds = xr.DataArray(np.ones(1_000_000), dims=('x',), name="foo").to_dataset()
df = ds.to_dataframe()
print(np.shares_memory(df.foo.values, ds.foo.values)) # -> False
# can see the memory locations
print(ds.foo.values.__array_interface__)
print(df.foo.values.__array_interface__)
# compare to this
df2 = pd.DataFrame(
{
"foo": ds.foo.values,
},
copy=False
)
np.shares_memory(df2.foo.values, ds.foo.values) # -> True
Solution
I propose we add a copy
keyword option to Dataset.to_dataframe()
(and similar for DataArray
) which defaults to False
(current behavior) but allows users to select True
if that's what they want.