Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options to binary ops kwargs #1065

Merged
merged 10 commits into from
Nov 12, 2016

Conversation

chunweiyuan
Copy link
Contributor

@chunweiyuan chunweiyuan commented Oct 27, 2016

Currently the default is join="inner". However, there can be applications where the majority of binary operations require join="outer", not "inner".

Addresses #1059

The solution we come up with is adding a default key/value to OPTIONS, and dynamically access/update it in code (_binary_op in both dataarray.py and dataset.py)

We've also added test modules in test_dataarray.py and test_dataset.py. All tests passed.

@shoyer
Copy link
Member

shoyer commented Oct 28, 2016

Let's find a more descriptive name for this option than merely join. Maybe: compute_join, compute_align, arithmetic_join, math_join, ...?

Also please add documentation notes about this new functionality to this section of the docs and the "what's new" page.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your tests look like they cover the bases, but they are a bit long. More minimal (e.g., 1D array) test cases would be better because they have a higher signal to noise ratio. Another option would be reuse existing test datasets (e.g., create_data_test()), since the actual values don't matter too much here. We already have plenty of other tests that verify that align works properly on higher dimensional data, so there's value in keeping these simple.

@functools.wraps(f)
def func(self, other):
if isinstance(other, (Dataset, groupby.GroupBy)):
return NotImplemented
if hasattr(other, 'indexes'):
self, other = align(self, other, join=join, copy=False)
# if user does not specify join, default to OPTIONS['join']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment just states what the code does, so it would be better to leave it out

@functools.wraps(f)
def func(self, other):
if isinstance(other, (Dataset, groupby.GroupBy)):
return NotImplemented
if hasattr(other, 'indexes'):
self, other = align(self, other, join=join, copy=False)
# if user does not specify join, default to OPTIONS['join']
how_to_join = join or OPTIONS['join']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use if join is None instead of implicitly testing join.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would trigger an UnboundLocalError:
if join is None: join = OPTIONS['join']
because join is not defined in the local scope prior to that line.

So the choice is either
1.) if join is None:
how_to_join = OPTIONS['join']
else:
how_to_join = join
or
2.) how_to_join = join or OPTIONS['join']

I opted for (2) for simplicity. Is the style of (1) generally preferred here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do the if/else inline, if you prefer:
how_to_join = OPTIONS['join'] if join is None else join

But yes, I do prefer explicitly checking against None for default values rather than relying on implicit falseness of None.

Copy link
Contributor Author

@chunweiyuan chunweiyuan Oct 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, your suggestion is the same as my first edit to the code, but then my teammates suggested I should use the other options.

Nomenclature is not my forte. I'll make the default
OPTIONS = {'display_width': 80, 'align_type': "inner"}
and change your suggestion to the following

align_type = OPTIONS['align_type'] if join is None else join
self, other = align(self, other, join=align_type, copy=False)

if there's no objection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also align in other operations, e.g., merge, for which the default is to do an outer join. So the option name should be specific to computation/arithmetic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary_op_align_type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, I'm using arithmetic_join....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think arithmetic_join or compute_join is probably the way to go. We will probably also use this when/if we ever add ternary operations, too.

A test method to verify the ability to set binary operation join kwarg
("inner", "outer", "left", "right") via xr.set_options().
"""
# First we set up a data array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sort of comments (repeating the code) just add noise

# create another data array with the last z slice missing
missing_last_z = arr[:,:,0:-1].copy()
# because the default in OPTIONS is join="inner", we test "outer" first
xr.set_options(join="outer")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the context manager for tests instead (e.g, with xr.set_options(join='outer'):. It cleans itself up if an exception is raised in the middle of the test, which makes it more hygenic. Otherwise, if the next assert fails it would also trigger a bunch of other failing tests when the option doesn't get reset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

# because the default in OPTIONS is join="inner", we test "outer" first
xr.set_options(join="outer")
result = missing_last_x + missing_last_z
self.assertTrue(result.shape == arr.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking specific properties, construct the full expected result (e.g., by calling align with join='outer' on the inputs) and use assertDataArrayEqual to verify that expect result is the result you get.

self.assertTrue(result.foo.notnull().all())
self.assertFalse('c' in list(result.foo['x']))
self.assertFalse(2 in list(result.foo['z']))
# just for kicks, what happens when the dataarrays have different names?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have tests for similar behavior -- see test_dataset_math_auto_align. This might belong there (more tests are great!), but it doesn't belong here.

It does raise the interesting question of what the behavior should be when join='outer' for Dataset objects with different data variables. Should the result now have the union of data variable names? If so, this will take a bit more retooling. (This would be easier after #964 is merged)


def test_binary_op_join_setting(self):
"""
A test method to verify the ability to set binary operation join kwarg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we don't put docstrings on test methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Will remove.

@chunweiyuan
Copy link
Contributor Author

Out of curiosity, how do I test inner-join vs. outer-join using only 1-D arrays? I can certainly simplify the current tests to be 2D + 1D.

@shoyer
Copy link
Member

shoyer commented Oct 28, 2016

Set coordinates on the 1D -- something like [0,1,2] for the first, [1,2,3]
for the second. Then the inner join should have [1,2] vs [0,1,2,3] for
outer join.
On Fri, Oct 28, 2016 at 12:00 PM chunweiyuan notifications@github.com
wrote:

Out of curiosity, how do I test inner-join vs. outer-join using only 1-D
arrays? I can certainly simplify the current tests to be 2D + 1D.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1065 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1qyX-HigWTUAJnBuCpR4AqNFzyYaks5q4kXigaJpZM4Ki4oh
.

dim = 'x'
align_type = "outer"
coords_l, coords_r = [0, 1, 2], [1, 2, 3]
missing_0 = xr.DataArray(coords_l, [(dim, coords_l)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names look backwards? missing_0 currently has labels [0,1,2].

experimental = missing_0 + missing_3
missing_0_aligned, missing_3_aligned =\
xr.align(missing_0, missing_3, join=align_type)
control = missing_0_aligned + missing_3_aligned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also explicitly set control = xr.DataArray([np.nan, 2, 4, np.nan], [(dim, [0, 1, 2, 3])]), which is actually even better than using align and arithmetic here

missing_0 = xr.DataArray(coords_l, [(dim, coords_l)])
missing_3 = xr.DataArray(coords_r, [(dim, coords_r)])
with xr.set_options(arithmetic_join=align_type):
experimental = missing_0 + missing_3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop out of the set_options block after this line.


.. ipython:: python

xr.set_options(arithmetic_join="outer")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you only show the example using with?

I'm now wondering if we need add some mechanism to require using the context manager for arithmetic_join. If users don't use a context manager, very mysterious/nonlocal bugs could appear when this changes the behavior of unrelated code (e.g., in an import statement).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our use case we would set arithmetic_join="outer" from the get-go and run all of our binary calculations like that throughout our simulation. We'd use that as our default, and maybe occasionally switch to other join types. Requiring context manager every time we deviate from "inner" seems onerous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the permanent change example and left only the context manager example in there. I'm not sure what the best compromise is. Perhaps we could print some warning message when it's permanently set?

@shoyer
Copy link
Member

shoyer commented Oct 29, 2016

It occurs to me now that the other reason why people avoid global variables is because of concerns around thread safety and distributed computation. Even with the context manager, bad things will happen if you use set_options simultaneously from different threads.

I need to think about this a little more. Copious warnings in the docs might be enough, or maybe entering a set_options context should also enter an explicit threading.Lock() context.

On the other hand, the current implementation is basically identical to how functionality like NumPy's np.errstate() works, and people seem to be pretty happy with that.

@mrocklin does this seem like the sort of change we can do in xarray without seriously regretting it later? Xarray uses basically the same set_options machinery that dask has.

@mrocklin
Copy link
Contributor

I haven't noticed people getting caught by using set_options from multiple threads. It definitely isn't threadsafe but it's usually only used when constructing computations, which is often done from a single thread.

It would be good to ensure that this affects the graph and that the actual tasks don't depend on the global state.

@shoyer
Copy link
Member

shoyer commented Oct 29, 2016

It would be good to ensure that this affects the graph and that the actual tasks don't depend on the global state.

Indeed, we are safe here. align is turned into a bunch of NumPy or dask.array calls, so this option will simply effect how the graph is built.

@chunweiyuan
Copy link
Contributor Author

Just checking in. Is there something else that needs to be added to this PR before merge?

@shoyer
Copy link
Member

shoyer commented Nov 10, 2016

My concern is how this setting currently does not change how we join variables in Dataset math. Consider the current behavior -- it always does an inner join of the data variables:

In [2]: import xarray as xr

In [3]: ds1 = xr.Dataset({'foo': 1, 'bar': 2})

In [4]: ds2 = xr.Dataset({'bar': 2, 'baz': 3})

In [5]: ds1
Out[5]:
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    *empty*
Data variables:
    bar      int64 2
    foo      int64 1

In [6]: ds2
Out[6]:
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    *empty*
Data variables:
    bar      int64 2
    baz      int64 3

In [7]: ds1 + ds2
Out[7]:
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    *empty*
Data variables:
    bar      int64 4

Logically, it makes sense for this option to switch this to an outer join instead.

@chunweiyuan
Copy link
Contributor Author

Alright, if that's desired, I'd just change code in apply_over_both() of dataset._calculate_binary_op to react to different kwarg values, and adding a method in test_dataset.py. Should be able to push before the weekend is over.

@@ -26,6 +26,10 @@ class set_options(object):
"""
def __init__(self, **kwargs):
self.old = OPTIONS.copy()
for key in kwargs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rebase or merge master -- I actually already added in something very similar
https://github.com/pydata/xarray/blob/master/xarray/core/options.py

elif fillna:
# this shortcuts left alignment of variables for fillna

for k in set(lhs_data_vars) & set(rhs_data_vars):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to maintain the order of data variables: should be lhs first, then rhs only variables.

So, probably should use a loop over lhs_data_vars, followed by rhs_data_vars.

for k in set(lhs_data_vars) & set(rhs_data_vars):
dest_vars[k] = f(lhs_vars[k], rhs_vars[k])
if join in ["outer", "left"]:
for k in set(lhs_data_vars) - set(rhs_data_vars):
dest_vars[k] = lhs_vars[k]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only the right default for fillna. For other operations, should be NaN instead (or, more specifically, the result of f(lhs_vars[k], np.nan).

self.assertDatasetEqual(actual, expected)

with xr.set_options(arithmetic_join='outer'):
expected = xr.Dataset({'foo':1, 'bar': 4, 'baz': 3})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, should be xr.Dataset({'foo': nan, 'bar': 4, 'baz': nan}) for consistent outer join semantics

if k in rhs_data_vars:
dest_vars[k] = f(lhs_vars[k], rhs_vars[k])
elif join in ["left", "outer"]:
dest_vars[k] = lhs_vars[k] if fillna else\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per PEP8, use parenthesis to join across multiple lines instead of \





Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much white space here -- by convention, leave only 2 blank lines

missing_0 = xr.DataArray(coords_r, [(dim, coords_r)])
with xr.set_options(arithmetic_join=align_type):
actual = missing_0 + missing_3
missing_0_aligned, missing_3_aligned =\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line break after xr.align( instead of using the explicit \

@@ -1,7 +1,8 @@
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
OPTIONS = {'display_width': 80}
OPTIONS = {'display_width': 80,
'arithmetic_join': "inner"}


class set_options(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the docstring for this class?

@shoyer shoyer merged commit 95bca62 into pydata:master Nov 12, 2016
@shoyer
Copy link
Member

shoyer commented Nov 12, 2016

Thank you @chunweiyuan!

@chunweiyuan
Copy link
Contributor Author

woohoo!

@chunweiyuan chunweiyuan deleted the options_to_binary_ops_kwargs branch November 12, 2016 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants