-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Throwing this out there - happy to be shot down if people are opposed.
Current behavior / griping
Currently, coercing a dataset to a boolean invokes ds.__bool__, which in turn calls bool(ds.data_vars):
class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
...
def __bool__(self) -> bool:
return bool(self.data_vars)This has the unfortunate property of returning True as long as there is at least one data_variable, regardless of the contents.
Currently, the behavior of Dataset.__bool__ is, at least as far as I've seen, never helpful but frequently unhelpful. I've seen (and written) tests written for DataArrays being passed a Dataset and suddenly the tests are meaningless so many times. Conversely, I've never found a legitimate use case for bool(ds). As far as I can tell, this is essentially the same as len(ds.data_vars) > 0.
In fact, while testing out my proposed changes below on a fork, I found two tests in the xarray test suite that had succumbed to this issue: see #6122 and #6123.
This has been discussed before - see #4290. This discussion focused on the question "should bool(xr.Dataset({'a': False})) return False?". I agree that it's not clear when it should be false and picking a behavior which deviates from Mapping feels arbitrary and gross.
Proposed behavior
I'm proposing that the API be changed, so that bool(xr.Dataset({'a': False})) raise an error, similar to the implementation in pd.Series.
In this implementation in pandas, attempting to evaluate even a single-element series as a boolean raises an error:
In [14]: bool(pd.Series([False]))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-b0ad7f4d9277> in <module>
----> 1 bool(pd.Series([False]))
~/miniconda3/envs/rhodium-env/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
1532 @final
1533 def __nonzero__(self):
-> 1534 raise ValueError(
1535 f"The truth value of a {type(self).__name__} is ambiguous. "
1536 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().I understand hesitancy around changing the core API. That said, if anyone can find an important, correct use of bool(ds) in the wild I'll eat my hat :)
Implementation
This could be as simple as raising an error on ds.__bool__, something like:
class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
...
def __bool__(self) -> bool:
raise ValueError(
"The truth value of a Dataset is ambiguous. Reduce the data "
"variables to a scalar value with any(ds.values()) or "
"all(ds.values())."
)The only other change that would be needed is an assertion that directly calls bool(ds) in test_dataset::TestDataset.test_properties, which checks for the exact behavior I'm changing:
assert bool(ds)This would need to be changed to:
with pytest.raises(ValueError):
bool(ds)If this sounds good, I can submit a PR with these changes.