bool(ds) should raise a "the truth value of a Dataset is ambiguous" error #6124
Description
Throwing this out there - happy to be shot down if people are opposed.
Current behavior / griping
Currently, coercing a dataset to a boolean invokes ds.__bool__
, which in turn calls bool(ds.data_vars)
:
class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
...
def __bool__(self) -> bool:
return bool(self.data_vars)
This has the unfortunate property of returning True as long as there is at least one data_variable, regardless of the contents.
Currently, the behavior of Dataset.__bool__
is, at least as far as I've seen, never helpful but frequently unhelpful. I've seen (and written) tests written for DataArrays being passed a Dataset and suddenly the tests are meaningless so many times. Conversely, I've never found a legitimate use case for bool(ds)
. As far as I can tell, this is essentially the same as len(ds.data_vars) > 0
.
In fact, while testing out my proposed changes below on a fork, I found two tests in the xarray test suite that had succumbed to this issue: see #6122 and #6123.
This has been discussed before - see #4290. This discussion focused on the question "should bool(xr.Dataset({'a': False}))
return False?". I agree that it's not clear when it should be false and picking a behavior which deviates from Mapping feels arbitrary and gross.
Proposed behavior
I'm proposing that the API be changed, so that bool(xr.Dataset({'a': False}))
raise an error, similar to the implementation in pd.Series
.
In this implementation in pandas, attempting to evaluate even a single-element series as a boolean raises an error:
In [14]: bool(pd.Series([False]))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-b0ad7f4d9277> in <module>
----> 1 bool(pd.Series([False]))
~/miniconda3/envs/rhodium-env/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
1532 @final
1533 def __nonzero__(self):
-> 1534 raise ValueError(
1535 f"The truth value of a {type(self).__name__} is ambiguous. "
1536 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I understand hesitancy around changing the core API. That said, if anyone can find an important, correct use of bool(ds)
in the wild I'll eat my hat :)
Implementation
This could be as simple as raising an error on ds.__bool__
, something like:
class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
...
def __bool__(self) -> bool:
raise ValueError(
"The truth value of a Dataset is ambiguous. Reduce the data "
"variables to a scalar value with any(ds.values()) or "
"all(ds.values())."
)
The only other change that would be needed is an assertion that directly calls bool(ds)
in test_dataset::TestDataset.test_properties, which checks for the exact behavior I'm changing:
assert bool(ds)
This would need to be changed to:
with pytest.raises(ValueError):
bool(ds)
If this sounds good, I can submit a PR with these changes.