Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter defaults for concat, combine, open_mfdataset #8778

Open
dcherian opened this issue Feb 22, 2024 · 4 comments
Open

Stricter defaults for concat, combine, open_mfdataset #8778

dcherian opened this issue Feb 22, 2024 · 4 comments
Labels
enhancement topic-combine combine/concat/merge

Comments

@dcherian
Copy link
Contributor

dcherian commented Feb 22, 2024

Is your feature request related to a problem?

The defaults for concat are excessively permissive: data_vars="all", coords="different", compat="no_conflicts", join="outer". This comment illustrates why this can be hard to predict or understand: a seemingly unrelated option decode_cf controls whether a variable is in data_vars or coords, and can result in wildly different concatenation behaviour.

  1. This always concatenates data_vars along concat_dim even if they did not have that dimension to begin with.
  2. If the same coordinate var exists in different datasets/files, they will be sequentially compared for equality to decide whether they get concatenated.
  3. The outer join (applied along all dimensions that are not concat_dim) can result in very large datasets due to small floating points differences in the indexes, and also questionable behaviour with staggered grid datasets.
  4. "no_conflicts" basically picks the first not-NaN value after aligning all datasets, but is quite slow (we should be using duck_array_ops.nanfirst here I think).

While "convenient" this really just makes the default experience quite bad with hard-to-understand slowdowns.

Describe the solution you'd like

I propose we migrate to data_vars="minimal", coords="minimal", join="exact", compat="override". This should

  1. only concatenate data_vars and coords variables when they already have concat_dim.
  2. For any variables that do not have concat_dim, it will blindly pick them from the first file.
  3. join="exact" will prevent ballooning of dimension sizes due to floating point inequalities.
  4. These options will totally avoid any data reads unless explicitly requested by the user.

Unfortunately, this has a pretty big blast radius so we'd need a long deprecation cycle.

Describe alternatives you've considered

No response

Additional context

xref #4824
xref #1385
xref #8231
xref #5381
xref #2064
xref #2217

@TomNicholas
Copy link
Member

TomNicholas commented Feb 22, 2024

Thanks for raising this @dcherian. I support this, along with some effort to improve the docstrings/documentation/faq to make the default behaviour and potential pitfalls clearer.

What would a deprecation cycle for this look like? We're talking about changing the default values of 4 different keyword arguments, that get used in open_mfdataset, concat, combine_nested and combine_by_coords. Should we rip the bandaid off and change them all at once?

@TomNicholas TomNicholas added the topic-combine combine/concat/merge label Feb 22, 2024
@dcherian
Copy link
Contributor Author

Should we rip the bandaid off and change them all at once?

+1. it would be massively confusing otherwise.

@jsignell
Copy link
Contributor

Regarding deprecation cycle. I think something like:

  • add a setting option that turns on the new defaults so that people can opt in prematurely
  • If user does not specify each of those 4 kwargs raise a warning indicating the behavior will change in the future and suggest they use the setting option to opt in.
  • If user does specify those 4 kwargs -> no warning
  • If user has opted in to the setting option -> no warning

@max-sixty
Copy link
Collaborator

Broadly agree with @jsignell, one dissent:

If user does not specify each of those 4 kwargs raise a warning indicating the behavior will change in the future and suggest they use the setting option to opt in.

...iff the behavior from xarray would be different.

Otherwise this will be really noisy. Requires some work to assess whether the behavior will change in the concat call, but I think worthwhile / the cost of making the change...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement topic-combine combine/concat/merge
Projects
None yet
Development

No branches or pull requests

4 participants