Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

also apply combine_attrs to the attrs of the variables #4902

Merged
merged 31 commits into from
May 5, 2021
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
28b857e
enable the tests
keewis Feb 12, 2021
4ab0013
clear all attrs first
keewis Feb 12, 2021
c53c71a
implement the merging of variable attrs when merging
keewis Feb 12, 2021
df8354b
Merge branch 'master' into variable-combine_attrs
keewis Feb 12, 2021
a0a2265
implement the merging of variable attrs when concatenating
keewis Feb 12, 2021
9af2c3f
add tests for combine_attrs on variables with combine_*
keewis Feb 13, 2021
78f9ef9
modify tests in test_combine which have wrong expectation about attrs
keewis Feb 13, 2021
77d0208
move the attrs preserving into the try / except
keewis Feb 13, 2021
b234ddf
clear variable attrs where necessary
keewis Feb 13, 2021
0cbf1a4
rewrite the main object merge_attrs tests
keewis Feb 13, 2021
71d605a
clear the variable attrs first
keewis Feb 13, 2021
fb1fcb9
add combine_attrs="no_conflict"
keewis Feb 13, 2021
dad9c38
Merge branch 'master' into variable-combine_attrs
keewis Feb 27, 2021
3ee536d
add a entry to whats-new.rst
keewis Feb 27, 2021
f03c5e8
Merge branch 'master' into variable-combine_attrs
keewis Mar 28, 2021
78a3efd
Merge branch 'master' into variable-combine_attrs
keewis Apr 3, 2021
abacdc9
Update doc/whats-new.rst
keewis Apr 19, 2021
b4b64c1
dedent a hidden test
keewis Apr 5, 2021
cd3fd0e
fix whats-new.rst
keewis Apr 19, 2021
8ca5e44
conditionally exclude attrs
keewis Apr 20, 2021
34d6615
use add_attrs=False or construct manually instead of clearing attrs
keewis Apr 20, 2021
8ca3aeb
also merge attrs for indexed variables
keewis Apr 20, 2021
6e06dc4
use pytest.raises instead of raises_regex
keewis Apr 20, 2021
2c5c1be
Merge branch 'master' into variable-combine_attrs
keewis Apr 20, 2021
cc08b53
switch the default for merge's combine_attrs to override
keewis Apr 29, 2021
94c7896
Merge branch 'master' into variable-combine_attrs
keewis Apr 29, 2021
dffda43
use pytest.raises
keewis Apr 29, 2021
db0fc56
update whats-new.rst [skip-ci]
keewis Apr 29, 2021
a1f1dda
fix whats-new.rst [skip-ci]
keewis May 1, 2021
59f0732
Merge branch 'master' into variable-combine_attrs
keewis May 5, 2021
065e757
provide more context for the change of the default value [skip-ci]
keewis May 5, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,9 @@ v0.17.1 (unreleased)

New Features
~~~~~~~~~~~~

- apply ``combine_attrs`` on data variables and coordinate variables when concatenating
and merging datasets and dataarrays (:pull:`4902`).
By `Justus Magin <https://github.com/keewis>`_.
- Add :py:meth:`Dataset.query` and :py:meth:`DataArray.query` which enable indexing
of datasets and data arrays by evaluating query expressions against the values of the
data variables (:pull:`4984`). By `Alistair Miles <https://github.com/alimanfoo>`_.
Expand Down
4 changes: 2 additions & 2 deletions xarray/core/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -508,7 +508,7 @@ def ensure_common_dims(vars):
vars = ensure_common_dims([ds[k].variable for ds in datasets])
except KeyError:
raise ValueError("%r is not present in all datasets." % k)
combined = concat_vars(vars, dim, positions)
combined = concat_vars(vars, dim, positions, combine_attrs=combine_attrs)
assert isinstance(combined, Variable)
result_vars[k] = combined
elif k in result_vars:
Expand Down Expand Up @@ -572,7 +572,7 @@ def _dataarray_concat(
positions,
fill_value=fill_value,
join=join,
combine_attrs="drop",
combine_attrs=combine_attrs,
)

merged_attrs = merge_attrs([da.attrs for da in arrays], combine_attrs)
Expand Down
12 changes: 11 additions & 1 deletion xarray/core/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ def merge_collected(
grouped: Dict[Hashable, List[MergeElement]],
prioritized: Mapping[Hashable, MergeElement] = None,
compat: str = "minimal",
combine_attrs="override",
) -> Tuple[Dict[Hashable, Variable], Dict[Hashable, pd.Index]]:
"""Merge dicts of variables, while resolving conflicts appropriately.

Expand Down Expand Up @@ -222,11 +223,18 @@ def merge_collected(
% (name, variable.attrs, other_variable.attrs)
)
merged_vars[name] = variable
merged_vars[name].attrs = merge_attrs(
[var.attrs for var, _ in indexed_elements],
combine_attrs=combine_attrs,
)
merged_indexes[name] = index
else:
variables = [variable for variable, _ in elements_list]
try:
merged_vars[name] = unique_variable(name, variables, compat)
merged_vars[name].attrs = merge_attrs(
[var.attrs for var in variables], combine_attrs=combine_attrs
)
except MergeError:
if compat != "minimal":
# we need more than "minimal" compatibility (for which
Expand Down Expand Up @@ -613,7 +621,9 @@ def merge_core(
collected = collect_variables_and_indexes(aligned)

prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat)
variables, out_indexes = merge_collected(collected, prioritized, compat=compat)
variables, out_indexes = merge_collected(
collected, prioritized, compat=compat, combine_attrs=combine_attrs
)
assert_unique_multiindex_level_names(variables)

dims = calculate_dimensions(variables)
Expand Down
67 changes: 59 additions & 8 deletions xarray/core/variable.py
Original file line number Diff line number Diff line change
Expand Up @@ -1756,7 +1756,14 @@ def reduce(
return Variable(dims, data, attrs=attrs)

@classmethod
def concat(cls, variables, dim="concat_dim", positions=None, shortcut=False):
def concat(
cls,
variables,
dim="concat_dim",
positions=None,
shortcut=False,
combine_attrs="override",
):
"""Concatenate variables along a new or existing dimension.

Parameters
Expand All @@ -1779,13 +1786,27 @@ def concat(cls, variables, dim="concat_dim", positions=None, shortcut=False):
This option is used internally to speed-up groupby operations.
If `shortcut` is True, some checks of internal consistency between
arrays to concatenate are skipped.
combine_attrs : {"drop", "identical", "no_conflicts", "drop_conflicts", \
"override"}, default: "override"
String indicating how to combine attrs of the objects being merged:

- "drop": empty attrs on returned Dataset.
- "identical": all attrs must be the same on every object.
- "no_conflicts": attrs from all objects are combined, any that have
the same name must also have the same value.
- "drop_conflicts": attrs from all objects are combined, any that have
the same name but different values are dropped.
- "override": skip comparing and copy attrs from the first dataset to
the result.

Returns
-------
stacked : Variable
Concatenated Variable formed by stacking all the supplied variables
along the given dimension.
"""
from .merge import merge_attrs

if not isinstance(dim, str):
(dim,) = dim.dims

Expand All @@ -1810,7 +1831,9 @@ def concat(cls, variables, dim="concat_dim", positions=None, shortcut=False):
dims = (dim,) + first_var.dims
data = duck_array_ops.stack(arrays, axis=axis)

attrs = dict(first_var.attrs)
attrs = merge_attrs(
[var.attrs for var in variables], combine_attrs=combine_attrs
)
encoding = dict(first_var.encoding)
if not shortcut:
for var in variables:
Expand Down Expand Up @@ -2565,12 +2588,21 @@ def __setitem__(self, key, value):
raise TypeError("%s values cannot be modified" % type(self).__name__)

@classmethod
def concat(cls, variables, dim="concat_dim", positions=None, shortcut=False):
def concat(
cls,
variables,
dim="concat_dim",
positions=None,
shortcut=False,
combine_attrs="override",
):
"""Specialized version of Variable.concat for IndexVariable objects.

This exists because we want to avoid converting Index objects to NumPy
arrays, if possible.
"""
from .merge import merge_attrs

if not isinstance(dim, str):
(dim,) = dim.dims

Expand All @@ -2597,12 +2629,13 @@ def concat(cls, variables, dim="concat_dim", positions=None, shortcut=False):
# keep as str if possible as pandas.Index uses object (converts to numpy array)
data = maybe_coerce_to_str(data, variables)

attrs = dict(first_var.attrs)
attrs = merge_attrs(
[var.attrs for var in variables], combine_attrs=combine_attrs
)
if not shortcut:
for var in variables:
if var.dims != first_var.dims:
raise ValueError("inconsistent dimensions")
utils.remove_incompatible_items(attrs, var.attrs)

return cls(first_var.dims, data, attrs)

Expand Down Expand Up @@ -2776,7 +2809,13 @@ def _broadcast_compat_data(self, other):
return self_data, other_data, dims


def concat(variables, dim="concat_dim", positions=None, shortcut=False):
def concat(
variables,
dim="concat_dim",
positions=None,
shortcut=False,
combine_attrs="override",
):
"""Concatenate variables along a new or existing dimension.

Parameters
Expand All @@ -2799,6 +2838,18 @@ def concat(variables, dim="concat_dim", positions=None, shortcut=False):
This option is used internally to speed-up groupby operations.
If `shortcut` is True, some checks of internal consistency between
arrays to concatenate are skipped.
combine_attrs : {"drop", "identical", "no_conflicts", "drop_conflicts", \
"override"}, default: "override"
String indicating how to combine attrs of the objects being merged:

- "drop": empty attrs on returned Dataset.
- "identical": all attrs must be the same on every object.
- "no_conflicts": attrs from all objects are combined, any that have
the same name must also have the same value.
- "drop_conflicts": attrs from all objects are combined, any that have
the same name but different values are dropped.
- "override": skip comparing and copy attrs from the first dataset to
the result.

Returns
-------
Expand All @@ -2808,9 +2859,9 @@ def concat(variables, dim="concat_dim", positions=None, shortcut=False):
"""
variables = list(variables)
if all(isinstance(v, IndexVariable) for v in variables):
return IndexVariable.concat(variables, dim, positions, shortcut)
return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs)
else:
return Variable.concat(variables, dim, positions, shortcut)
return Variable.concat(variables, dim, positions, shortcut, combine_attrs)


def assert_unique_multiindex_level_names(variables):
Expand Down
154 changes: 149 additions & 5 deletions xarray/tests/test_combine.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,14 @@
import numpy as np
import pytest

from xarray import DataArray, Dataset, combine_by_coords, combine_nested, concat
from xarray import (
DataArray,
Dataset,
MergeError,
combine_by_coords,
combine_nested,
concat,
)
from xarray.core import dtypes
from xarray.core.combine import (
_check_shape_tile_ids,
Expand Down Expand Up @@ -476,7 +483,8 @@ def test_concat_name_symmetry(self):
assert_identical(x_first, y_first)

def test_concat_one_dim_merge_another(self):
data = create_test_data()
data = create_test_data(add_attrs=False)

data1 = data.copy(deep=True)
data2 = data.copy(deep=True)

Expand All @@ -502,7 +510,7 @@ def test_auto_combine_2d(self):
assert_equal(result, expected)

def test_auto_combine_2d_combine_attrs_kwarg(self):
ds = create_test_data
ds = lambda x: create_test_data(x, add_attrs=False)

partway1 = concat([ds(0), ds(3)], dim="dim1")
partway2 = concat([ds(1), ds(4)], dim="dim1")
Expand Down Expand Up @@ -675,8 +683,8 @@ def test_combine_by_coords(self):
with pytest.raises(ValueError, match=r"Every dimension needs a coordinate"):
combine_by_coords(objs)

def test_empty_input(self):
assert_identical(Dataset(), combine_by_coords([]))
def test_empty_input(self):
assert_identical(Dataset(), combine_by_coords([]))

@pytest.mark.parametrize(
"join, expected",
Expand Down Expand Up @@ -754,6 +762,142 @@ def test_combine_nested_combine_attrs_drop_conflicts(self):
)
assert_identical(expected, actual)

@pytest.mark.parametrize(
"combine_attrs, attrs1, attrs2, expected_attrs, expect_exception",
[
(
"no_conflicts",
{"a": 1, "b": 2},
{"a": 1, "c": 3},
{"a": 1, "b": 2, "c": 3},
False,
),
("no_conflicts", {"a": 1, "b": 2}, {}, {"a": 1, "b": 2}, False),
("no_conflicts", {}, {"a": 1, "c": 3}, {"a": 1, "c": 3}, False),
(
"no_conflicts",
{"a": 1, "b": 2},
{"a": 4, "c": 3},
{"a": 1, "b": 2, "c": 3},
True,
),
("drop", {"a": 1, "b": 2}, {"a": 1, "c": 3}, {}, False),
("identical", {"a": 1, "b": 2}, {"a": 1, "b": 2}, {"a": 1, "b": 2}, False),
("identical", {"a": 1, "b": 2}, {"a": 1, "c": 3}, {"a": 1, "b": 2}, True),
(
"override",
{"a": 1, "b": 2},
{"a": 4, "b": 5, "c": 3},
{"a": 1, "b": 2},
False,
),
(
"drop_conflicts",
{"a": 1, "b": 2, "c": 3},
{"b": 1, "c": 3, "d": 4},
{"a": 1, "c": 3, "d": 4},
False,
),
],
)
def test_combine_nested_combine_attrs_variables(
self, combine_attrs, attrs1, attrs2, expected_attrs, expect_exception
):
"""check that combine_attrs is used on data variables and coords"""
data1 = Dataset(
{
"a": ("x", [1, 2], attrs1),
"b": ("x", [3, -1], attrs1),
"x": ("x", [0, 1], attrs1),
}
)
data2 = Dataset(
{
"a": ("x", [2, 3], attrs2),
"b": ("x", [-2, 1], attrs2),
"x": ("x", [2, 3], attrs2),
}
)

if expect_exception:
with raises_regex(MergeError, "combine_attrs"):
combine_by_coords([data1, data2], combine_attrs=combine_attrs)
else:
actual = combine_by_coords([data1, data2], combine_attrs=combine_attrs)
expected = Dataset(
{
"a": ("x", [1, 2, 2, 3], expected_attrs),
"b": ("x", [3, -1, -2, 1], expected_attrs),
},
{"x": ("x", [0, 1, 2, 3], expected_attrs)},
)

assert_identical(actual, expected)

@pytest.mark.parametrize(
"combine_attrs, attrs1, attrs2, expected_attrs, expect_exception",
[
(
"no_conflicts",
{"a": 1, "b": 2},
{"a": 1, "c": 3},
{"a": 1, "b": 2, "c": 3},
False,
),
("no_conflicts", {"a": 1, "b": 2}, {}, {"a": 1, "b": 2}, False),
("no_conflicts", {}, {"a": 1, "c": 3}, {"a": 1, "c": 3}, False),
(
"no_conflicts",
{"a": 1, "b": 2},
{"a": 4, "c": 3},
{"a": 1, "b": 2, "c": 3},
True,
),
("drop", {"a": 1, "b": 2}, {"a": 1, "c": 3}, {}, False),
("identical", {"a": 1, "b": 2}, {"a": 1, "b": 2}, {"a": 1, "b": 2}, False),
("identical", {"a": 1, "b": 2}, {"a": 1, "c": 3}, {"a": 1, "b": 2}, True),
(
"override",
{"a": 1, "b": 2},
{"a": 4, "b": 5, "c": 3},
{"a": 1, "b": 2},
False,
),
(
"drop_conflicts",
{"a": 1, "b": 2, "c": 3},
{"b": 1, "c": 3, "d": 4},
{"a": 1, "c": 3, "d": 4},
False,
),
],
)
def test_combine_by_coords_combine_attrs_variables(
self, combine_attrs, attrs1, attrs2, expected_attrs, expect_exception
):
"""check that combine_attrs is used on data variables and coords"""
data1 = Dataset(
{"x": ("a", [0], attrs1), "y": ("a", [0], attrs1), "a": ("a", [0], attrs1)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also check for Dataset.attrs?

Copy link
Collaborator Author

@keewis keewis Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought we already did that. Turns out we don't, which I didn't realize because that file is in urgent need of a refactor: some test names still reference the removed auto_combine, combine_by_coords and combine_nested are ordered in a way I can't figure out, and I accidentally contributed to the chaos by adding all new tests to TestCombineAuto. Since this seems like a bigger effort I'll try to send in a PR which cleans this up.

)
data2 = Dataset(
{"x": ("a", [1], attrs2), "y": ("a", [1], attrs2), "a": ("a", [1], attrs2)}
)

if expect_exception:
with raises_regex(MergeError, "combine_attrs"):
combine_by_coords([data1, data2], combine_attrs=combine_attrs)
else:
actual = combine_by_coords([data1, data2], combine_attrs=combine_attrs)
expected = Dataset(
{
"x": ("a", [0, 1], expected_attrs),
"y": ("a", [0, 1], expected_attrs),
"a": ("a", [0, 1], expected_attrs),
}
)

assert_identical(actual, expected)

def test_infer_order_from_coords(self):
data = create_test_data()
objs = [data.isel(dim2=slice(4, 9)), data.isel(dim2=slice(4))]
Expand Down
Loading