-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve order of variables in combine_by_coords #9070
base: main
Are you sure you want to change the base?
Conversation
I think the reason I originally put the sort call in there was because in the
But if it seems to work without that then I guess it's fine? |
I think this was from times where we had to deal with unsorted dict. At least that was what I understood from a previous comment at that code position. |
Now, in light of this... But somehow it works. |
OK, let me add some more testing to be sure this works in datasets of any order. |
the reason |
Thanks @keewis. AFAICT, the sorting before groupby makes sure that all Datasets with same variables are grouped together regardless of the variable order within each Dataset. Update: And the position in the object list. Removing the sorting will result in more groups but doesn't break the tests. Does that mean we are undertesting?. Or is it just fixed by the subsequent merge? The issue which should be solved by this PR is that this sorting rearranges the input objects and when using |
After reading on itertools and collections I've found that it's possible to change itertools.groupby with a collections.defaultdict implementation to preserve order and with some intermediate performance gain: import xarray as xr
import collections
import itertools
def vars_as_keys(ds):
return tuple(sorted(ds))
def groupby_defaultdict(iter, key=lambda x: x):
idx = collections.defaultdict(list)
for i, obj in enumerate(iter):
idx[key(obj)].append(i)
for k, ix in idx.items():
yield k, (iter[i] for i in ix)
def groupby_itertools(iter, key=lambda x: x):
iter = sorted(iter, key=vars_as_keys)
return itertools.groupby(iter, key=vars_as_keys)
x1 = xr.Dataset({"a": (("y", "x"), [[1]]),
"c": (("y", "x"), [[1]]),
"b": (("y", "x"), [[1]]),},
coords={"y": [0], "x": [0]})
x2 = xr.Dataset({"d": (("y", "x"), [[1]]),
"a": (("y", "x"), [[2]]),},
coords={"y": [0], "x": [0]})
x3 = xr.Dataset({"a": (("y", "x"), [[3]]),
"d": (("y", "x"), [[2]]),},
coords={"y": [0], "x": [1]})
data_objects = [x2, x1, x3] %%timeit
grouped_by_vars = groupby_defaultdict(data_objects, key=vars_as_keys)
274 ns ± 0.934 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) %%timeit
grouped_by_vars = groupby_itertools(data_objects, key=vars_as_keys)
4.98 µs ± 14 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) |
It looks like this can be replaced here too: Lines 258 to 261 in 447e5a3
|
whats-new.rst