-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Options to binary ops kwargs #1065
Options to binary ops kwargs #1065
Conversation
Let's find a more descriptive name for this option than merely Also please add documentation notes about this new functionality to this section of the docs and the "what's new" page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your tests look like they cover the bases, but they are a bit long. More minimal (e.g., 1D array) test cases would be better because they have a higher signal to noise ratio. Another option would be reuse existing test datasets (e.g., create_data_test()
), since the actual values don't matter too much here. We already have plenty of other tests that verify that align
works properly on higher dimensional data, so there's value in keeping these simple.
@functools.wraps(f) | ||
def func(self, other): | ||
if isinstance(other, (Dataset, groupby.GroupBy)): | ||
return NotImplemented | ||
if hasattr(other, 'indexes'): | ||
self, other = align(self, other, join=join, copy=False) | ||
# if user does not specify join, default to OPTIONS['join'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment just states what the code does, so it would be better to leave it out
@functools.wraps(f) | ||
def func(self, other): | ||
if isinstance(other, (Dataset, groupby.GroupBy)): | ||
return NotImplemented | ||
if hasattr(other, 'indexes'): | ||
self, other = align(self, other, join=join, copy=False) | ||
# if user does not specify join, default to OPTIONS['join'] | ||
how_to_join = join or OPTIONS['join'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use if join is None
instead of implicitly testing join
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would trigger an UnboundLocalError:
if join is None: join = OPTIONS['join']
because join is not defined in the local scope prior to that line.
So the choice is either
1.) if join is None:
how_to_join = OPTIONS['join']
else:
how_to_join = join
or
2.) how_to_join = join or OPTIONS['join']
I opted for (2) for simplicity. Is the style of (1) generally preferred here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do the if
/else
inline, if you prefer:
how_to_join = OPTIONS['join'] if join is None else join
But yes, I do prefer explicitly checking against None
for default values rather than relying on implicit falseness of None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny, your suggestion is the same as my first edit to the code, but then my teammates suggested I should use the other options.
Nomenclature is not my forte. I'll make the default
OPTIONS = {'display_width': 80, 'align_type': "inner"}
and change your suggestion to the following
align_type = OPTIONS['align_type'] if join is None else join
self, other = align(self, other, join=align_type, copy=False)
if there's no objection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also align in other operations, e.g., merge
, for which the default is to do an outer join. So the option name should be specific to computation/arithmetic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binary_op_align_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, I'm using arithmetic_join
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think arithmetic_join
or compute_join
is probably the way to go. We will probably also use this when/if we ever add ternary operations, too.
A test method to verify the ability to set binary operation join kwarg | ||
("inner", "outer", "left", "right") via xr.set_options(). | ||
""" | ||
# First we set up a data array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sort of comments (repeating the code) just add noise
# create another data array with the last z slice missing | ||
missing_last_z = arr[:,:,0:-1].copy() | ||
# because the default in OPTIONS is join="inner", we test "outer" first | ||
xr.set_options(join="outer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the context manager for tests instead (e.g, with xr.set_options(join='outer'):
. It cleans itself up if an exception is raised in the middle of the test, which makes it more hygenic. Otherwise, if the next assert fails it would also trigger a bunch of other failing tests when the option doesn't get reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
# because the default in OPTIONS is join="inner", we test "outer" first | ||
xr.set_options(join="outer") | ||
result = missing_last_x + missing_last_z | ||
self.assertTrue(result.shape == arr.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking specific properties, construct the full expected result (e.g., by calling align
with join='outer'
on the inputs) and use assertDataArrayEqual
to verify that expect result is the result you get.
self.assertTrue(result.foo.notnull().all()) | ||
self.assertFalse('c' in list(result.foo['x'])) | ||
self.assertFalse(2 in list(result.foo['z'])) | ||
# just for kicks, what happens when the dataarrays have different names? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have tests for similar behavior -- see test_dataset_math_auto_align
. This might belong there (more tests are great!), but it doesn't belong here.
It does raise the interesting question of what the behavior should be when join='outer'
for Dataset objects with different data variables. Should the result now have the union of data variable names? If so, this will take a bit more retooling. (This would be easier after #964 is merged)
|
||
def test_binary_op_join_setting(self): | ||
""" | ||
A test method to verify the ability to set binary operation join kwarg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we don't put docstrings on test methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Will remove.
Out of curiosity, how do I test inner-join vs. outer-join using only 1-D arrays? I can certainly simplify the current tests to be 2D + 1D. |
Set coordinates on the 1D -- something like [0,1,2] for the first, [1,2,3]
|
…ed to computation.rst and whats-new.rst
dim = 'x' | ||
align_type = "outer" | ||
coords_l, coords_r = [0, 1, 2], [1, 2, 3] | ||
missing_0 = xr.DataArray(coords_l, [(dim, coords_l)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names look backwards? missing_0
currently has labels [0,1,2].
experimental = missing_0 + missing_3 | ||
missing_0_aligned, missing_3_aligned =\ | ||
xr.align(missing_0, missing_3, join=align_type) | ||
control = missing_0_aligned + missing_3_aligned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also explicitly set control = xr.DataArray([np.nan, 2, 4, np.nan], [(dim, [0, 1, 2, 3])])
, which is actually even better than using align and arithmetic here
missing_0 = xr.DataArray(coords_l, [(dim, coords_l)]) | ||
missing_3 = xr.DataArray(coords_r, [(dim, coords_r)]) | ||
with xr.set_options(arithmetic_join=align_type): | ||
experimental = missing_0 + missing_3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop out of the set_options
block after this line.
|
||
.. ipython:: python | ||
|
||
xr.set_options(arithmetic_join="outer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you only show the example using with
?
I'm now wondering if we need add some mechanism to require using the context manager for arithmetic_join
. If users don't use a context manager, very mysterious/nonlocal bugs could appear when this changes the behavior of unrelated code (e.g., in an import statement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our use case we would set arithmetic_join="outer"
from the get-go and run all of our binary calculations like that throughout our simulation. We'd use that as our default, and maybe occasionally switch to other join types. Requiring context manager every time we deviate from "inner" seems onerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the permanent change example and left only the context manager example in there. I'm not sure what the best compromise is. Perhaps we could print some warning message when it's permanently set?
It occurs to me now that the other reason why people avoid global variables is because of concerns around thread safety and distributed computation. Even with the context manager, bad things will happen if you use I need to think about this a little more. Copious warnings in the docs might be enough, or maybe entering a On the other hand, the current implementation is basically identical to how functionality like NumPy's @mrocklin does this seem like the sort of change we can do in xarray without seriously regretting it later? Xarray uses basically the same |
I haven't noticed people getting caught by using It would be good to ensure that this affects the graph and that the actual tasks don't depend on the global state. |
Indeed, we are safe here. |
Just checking in. Is there something else that needs to be added to this PR before merge? |
My concern is how this setting currently does not change how we join variables in Dataset math. Consider the current behavior -- it always does an inner join of the data variables:
Logically, it makes sense for this option to switch this to an outer join instead. |
Alright, if that's desired, I'd just change code in |
@@ -26,6 +26,10 @@ class set_options(object): | |||
""" | |||
def __init__(self, **kwargs): | |||
self.old = OPTIONS.copy() | |||
for key in kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rebase or merge master -- I actually already added in something very similar
https://github.com/pydata/xarray/blob/master/xarray/core/options.py
elif fillna: | ||
# this shortcuts left alignment of variables for fillna | ||
|
||
for k in set(lhs_data_vars) & set(rhs_data_vars): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important to maintain the order of data variables: should be lhs
first, then rhs
only variables.
So, probably should use a loop over lhs_data_vars
, followed by rhs_data_vars
.
for k in set(lhs_data_vars) & set(rhs_data_vars): | ||
dest_vars[k] = f(lhs_vars[k], rhs_vars[k]) | ||
if join in ["outer", "left"]: | ||
for k in set(lhs_data_vars) - set(rhs_data_vars): | ||
dest_vars[k] = lhs_vars[k] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only the right default for fillna
. For other operations, should be NaN
instead (or, more specifically, the result of f(lhs_vars[k], np.nan)
.
self.assertDatasetEqual(actual, expected) | ||
|
||
with xr.set_options(arithmetic_join='outer'): | ||
expected = xr.Dataset({'foo':1, 'bar': 4, 'baz': 3}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted above, should be xr.Dataset({'foo': nan, 'bar': 4, 'baz': nan})
for consistent outer join semantics
if k in rhs_data_vars: | ||
dest_vars[k] = f(lhs_vars[k], rhs_vars[k]) | ||
elif join in ["left", "outer"]: | ||
dest_vars[k] = lhs_vars[k] if fillna else\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per PEP8, use parenthesis to join across multiple lines instead of \
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too much white space here -- by convention, leave only 2 blank lines
missing_0 = xr.DataArray(coords_r, [(dim, coords_r)]) | ||
with xr.set_options(arithmetic_join=align_type): | ||
actual = missing_0 + missing_3 | ||
missing_0_aligned, missing_3_aligned =\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line break after xr.align(
instead of using the explicit \
@@ -1,7 +1,8 @@ | |||
from __future__ import absolute_import | |||
from __future__ import division | |||
from __future__ import print_function | |||
OPTIONS = {'display_width': 80} | |||
OPTIONS = {'display_width': 80, | |||
'arithmetic_join': "inner"} | |||
|
|||
|
|||
class set_options(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update the docstring for this class?
Thank you @chunweiyuan! |
woohoo! |
Currently the default is join="inner". However, there can be applications where the majority of binary operations require join="outer", not "inner".
Addresses #1059
The solution we come up with is adding a default key/value to OPTIONS, and dynamically access/update it in code (_binary_op in both dataarray.py and dataset.py)
We've also added test modules in test_dataarray.py and test_dataset.py. All tests passed.