Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset constructor always coerces 1D data variables with same name as dim to coordinates #8959

Open
TomNicholas opened this issue Apr 19, 2024 · 10 comments

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Apr 19, 2024

What is your issue?

Whilst xarray's data model appears to allow 1D data variables that have the same name as their dimension, it seems to be impossible to actually create this using the Dataset constructor, as they will always be converted to coordinate variables instead.

We can create a 1D data variable with the same name as it's dimension like this:

In [9]: ds = xr.Dataset({'x': 0})

In [10]: ds
Out[10]: 
<xarray.Dataset> Size: 8B
Dimensions:  ()
Data variables:
    x        int64 8B 0

In [11]: ds.expand_dims('x')
Out[11]: 
<xarray.Dataset> Size: 8B
Dimensions:  (x: 1)
Dimensions without coordinates: x
Data variables:
    x        (x) int64 8B 0

so it seems to be a valid part of the data model.

But I can't get to that situation from the Dataset constructor. This should create the same dataset:

In [15]: ds = xr.Dataset(data_vars={'x': ('x', [0])})

In [16]: ds
Out[16]: 
<xarray.Dataset> Size: 8B
Dimensions:  (x: 1)
Coordinates:
  * x        (x) int64 8B 0
Data variables:
    *empty*

But actually it makes x a coordinate variable (and implicitly creates a pandas Index for it). This means that in this case there is no difference between using the data_vars and coords kwargs to the constructor:

ds = xr.Dataset(coords={'x': ('x', [0])})

In [18]: ds
Out[18]: 
<xarray.Dataset> Size: 8B
Dimensions:  (x: 1)
Coordinates:
  * x        (x) int64 8B 0
Data variables:
    *empty*

This all seems weird to me. I would have thought that if a 1D data variable is allowed, we shouldn't coerce to making it a coordinate variable in the constructor. If anything that's actively misleading.

Note that whilst this came up in the context of trying to avoid auto-creation of 1D indexes for coordinate variables, this issue is actually separate. (xref #8872 (comment))

cc @benbovy who probably has thoughts

@dopplershift
Copy link
Contributor

I think that comes explicitly from netCDF User's Guide and the CF Conventions("coordinate varible")

Though reading now, I guess it's saying all coordinate variables must be 1D with the same name as the dim, not necessarily the converse.

@benbovy
Copy link
Member

benbovy commented Apr 19, 2024

Auto-promoting dimension data variables as dimension coordinates when creating a new Dataset has been indeed the expected behavior so far.

I'm not sure what best we should do, though. On one hand, xr.Dataset(data_vars={"x": [0]}) creating a "x" coordinate looks definitely odd to me too. On the other hand, xr.Dataset({"x": [0]}) not creating a dimension coordinate would feel quite disruptive to me.

@TomNicholas
Copy link
Member Author

Thanks for that context @dopplershift !

The way I see it there are two consistent behaviours:

  1. all variables that are 1D with the same name as the dim must be coordinate variables, and both Dataset.__init__ and .expand_dims() should therefore coerce data variables to coordinate variables if necessary,
  2. variables that are 1D with the same name as the dim do not have to be coordinate variables, and we shouldn't coerce anywhere

On the other hand, xr.Dataset({"x": [0]}) not creating a dimension coordinate would feel quite disruptive to me.

Yeah this would break quite a lot of user code...

@TomNicholas
Copy link
Member Author

TomNicholas commented Apr 19, 2024

On the other hand, xr.Dataset({"x": [0]}) not creating a dimension coordinate would feel quite disruptive to me.

Is it possible to imagine a deprecation cycle for this? One which detects this input pattern and raises a warning telling you to use xr.Dataset(coords={"x": [0]}) instead?

@keewis
Copy link
Collaborator

keewis commented Apr 19, 2024

what we could do is split the behavior between keyword-argument data_vars and positional argument number 1:

def __init__(self, vars=None, /, coords=None, attrs=None, *, data_vars=None):
    ...

where vars promotes to coordinates, but data_vars doesn't (and you can't pass vars and data_vars at the same time).

Then, at some point, we deprecate the positional argument or all positionals (not sure how that part should look like exactly, though).

@benbovy
Copy link
Member

benbovy commented Apr 19, 2024

That looks like a nice solution @keewis, except maybe xr.Dataset(data_vars={"x": [0]}) not creating an "x" coordinate, which would be a breaking change (in theory) and which would require another deprecation cycle?

@keewis
Copy link
Collaborator

keewis commented Apr 19, 2024

true. But I guess people passing data_vars explicitly would be less likely to expect creating a coordinate variable from it, so we could shorten the cycle a bit (like, warn for 2-3 releases then switch to the new behavior).

@dcherian
Copy link
Contributor

We decided to start by raising PendingDeprecationWarning to slowly discourage this usage.

@shoyer
Copy link
Member

shoyer commented Apr 24, 2024

what we could do is split the behavior between keyword-argument data_vars and positional argument number 1:

def __init__(self, vars=None, /, coords=None, attrs=None, *, data_vars=None):
    ...

where vars promotes to coordinates, but data_vars doesn't (and you can't pass vars and data_vars at the same time).

I like this idea. It makes the current behavior more explicit and makes it clear how to opt-in to the new behavior.

@TomNicholas
Copy link
Member Author

#8979 implements this suggestion. The same PendingDeprecationWarning is issued if a variable passed to data_vars is auto-promoted to a coordinate. Eventually we would have to break that auto-promotion and remove the vars positional-only argument to complete the deprecation cycle.

@TomNicholas TomNicholas changed the title Dataset constructor always coerces 1D data variables with same as dim to coordinates Dataset constructor always coerces 1D data variables with same name as dim to coordinates Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To do
Development

No branches or pull requests

6 participants