Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_mfdataset overwrites variables with different values but overlapping coordinates #4077

Open
malmans2 opened this issue May 18, 2020 · 12 comments
Labels
bug topic-combine combine/concat/merge

Comments

@malmans2
Copy link
Contributor

In the example below I'm opening and concatenating two datasets using open_mfdataset. These datasets have variables with different values but overlapping coordinates. I'm concatenating along y, which is 0...4 in one dataset and 0...5 in the other. The y dimension of the resulting dataset is 0...5 which means that open_mfdataset has overwritten some values without showing any error/warning.

Is this the expected default behavior? I would expect to get at least a warning, but maybe I'm misunderstanding the default arguments.

I tried to play with the arguments, but I couldn't figure out which argument I should change to get an error in these scenarios.

MCVE Code Sample

import xarray as xr
import numpy as np
for i in range(2):
    ds = xr.Dataset(
        {"foo": (("x", "y"), np.random.rand(4, 5 + i))},
        coords={"x": np.arange(4), "y": np.arange(5 + i)},
    )
    print(ds)
    ds.to_netcdf(f"tmp{i}.nc")
<xarray.Dataset>
Dimensions:  (x: 4, y: 5)
Coordinates:
  * x        (x) int64 0 1 2 3
  * y        (y) int64 0 1 2 3 4
Data variables:
    foo      (x, y) float64 0.1271 0.6117 0.3769 0.1884 ... 0.853 0.5026 0.3762
<xarray.Dataset>
Dimensions:  (x: 4, y: 6)
Coordinates:
  * x        (x) int64 0 1 2 3
  * y        (y) int64 0 1 2 3 4 5
Data variables:
    foo      (x, y) float64 0.2841 0.6098 0.7761 0.0673 ... 0.2954 0.7212 0.3954
DS = xr.open_mfdataset("tmp*.nc", concat_dim="y", combine="by_coords")
print(DS)
<xarray.Dataset>
Dimensions:  (x: 4, y: 6)
Coordinates:
  * x        (x) int64 0 1 2 3
  * y        (y) int64 0 1 2 3 4 5
Data variables:
    foo      (x, y) float64 dask.array<chunksize=(4, 6), meta=np.ndarray>

Versions

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.2 | packaged by conda-forge | (default, Apr 24 2020, 08:20:52)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.15.1
pandas: 1.0.3
numpy: 1.18.4
scipy: None
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.1.3
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.16.0
distributed: 2.16.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 46.4.0.post20200518
pip: 20.1
conda: None
pytest: None
IPython: 7.13.0
sphinx: None

@mathause
Copy link
Collaborator

Yes, xr.combine_by_coords only ensures that the coordinates are monotonically increasing. It does not check that they (a) don't have the same start (your case) and (b) if the end of ds0 is equal to the start of ds1 (which may also be undesirable).

The magic happens here:

def _infer_concat_order_from_coords(datasets):

In your case it just uses the rightmost array (compare xr.combine_by_coords([ds0, ds1]) and xr.combine_by_coords([ds1, ds0]).

(Note that concat_dim="y" is ignored when using combine_by_coords).

@malmans2
Copy link
Contributor Author

Got it, Thanks!
Let me know if it is worth adding some checks. I'd be happy to work on it.

@mathause
Copy link
Collaborator

Raising an error when the start time is equal is certainly a good idea. What I am less sure about is what to do when the end is equal to the start - maybe a warning?

The second case would be the following:

print(ds0)
print(ds1)
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 0 1
Data variables:
    foo ...

<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 1 2
Data variables:
    foo ...

and auto_combine would lead to:

xr.combine_by_coords([ds0, ds1])
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 0 1 1 2
Data variables:
    foo ...

For the first case you can probably check if all elements of order are unique:

order = rank.astype(int).values - 1

ps: Overlapping indices are not a problem - it is checked that the result is monotonic:

# Check the overall coordinates are monotonically increasing

@mathause
Copy link
Collaborator

The second part could probably be tested just below this if:

if not (indexes.is_monotonic_increasing or indexes.is_monotonic_decreasing):

using

if not indexes.is_unique:
    raise ValueError("")

(or a warning)

@TomNicholas
Copy link
Member

TomNicholas commented May 19, 2020

Thanks for reporting this @malmans2!

There are actually two issues here: The minor one is that it should never have been possible to specify concat_dim and combine='by_coords' to open_mfdataset simultaneously. You should have got an error already at that point. xr.combine_by_coords doesn't accept a concat_dim argument, so neither should xr.open_mfdataset(..., combine='by_coords').

The more complex issue is that you can get the same overwriting problem in xr.combine_by_coords alone...

That was actually deliberate, xr.combine_by_coords is only checking the first value of each coord is different, to avoid loading big coordinates into memory. (see this line) As the first y value is 0 in both cases it's just saying "we have a match!" and overwriting.

@shoyer we discussed that PR (#2616) extensively, but I can't see an explicit record of discussing that particular line?

But since then @dcherian has done work on the options which vary the strictness of checking - should compat also vary this behaviour?

EDIT: (sorry for repeating what was said above, I wrote this reply last night and sent it today)

@dcherian
Copy link
Contributor

What is the expected outcome here? An error?

The only way I can think of to combine these two datasets without losing data is to do combine_nested([ds0, ds1], concat_dim="new_dim").

@shoyer
Copy link
Member

shoyer commented May 19, 2020

That was actually deliberate, xr.combine_by_coords is only checking the first value of each coord is different, to avoid loading big coordinates into memory. (see this line) As the first y value is 0 in both cases it's just saying "we have a match!" and overwriting.

We already have the coordinates loaded into memory at this point -- each elements of indexes is a pandas.Index.

Looking at the first values makes sense for determining the order, but doesn't guarantee that they are safe to concatenate. The contract of I think we are missing another safety check verifying indexes[i][-1] <= indexes[i+1][0] for all indexes in order, in a way that handles ties correctly.

In my opinion, xarray's combine functions like combine_by_coords should never override values, unless an unsafe option was explicitly chosen.

@malmans2
Copy link
Contributor Author

If indexes[i] = [1, 5] and indexes[i+1] = [2, 3, 4], wouldn't indexes[i][-1] <= indexes[i+1][0] raise an error even if all indexes are different?

What about something like this? I think it would cover all possibilities, but maybe it is too expensive?

if not indexes[0].append(indexes[1:]).is_unique:
    raise ValueError

@malmans2
Copy link
Contributor Author

Nevermind, it looks like if the check goes into _infer_concat_order_from_coords it won't affect combine_nested. So indexes[i][-1] <= indexes[i+1][0] should work.

@TomNicholas
Copy link
Member

So indexes[i][-1] <= indexes[i+1][0] should work.

@malmans2 are you interested in submitting a pull request to add this? (If not then that's fine!)

@malmans2
Copy link
Contributor Author

malmans2 commented May 25, 2020

Yup, happy to do it.

Just one doubt. I think in cases where indexes[i][-1] == indexes[i+1][0], the concatenation should be consistent with the compat argument used for merge (not sure if you guys agree on this). I don't know the backend though, so the easiest thing I can think about is to run merge to trigger the exact same checks:

xr.merge([datasets[i].isel(dim=-1), datasets[i+1].isel(dim=0)], compat=compat)

@stale
Copy link

stale bot commented Apr 28, 2022

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Apr 28, 2022
@dcherian dcherian added topic-combine combine/concat/merge and removed stale labels Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-combine combine/concat/merge
Projects
None yet
Development

No branches or pull requests

5 participants