Skip to content

bool(ds) should raise a "the truth value of a Dataset is ambiguous" error #6124

Open
@delgadom

Description

Throwing this out there - happy to be shot down if people are opposed.

Current behavior / griping

Currently, coercing a dataset to a boolean invokes ds.__bool__, which in turn calls bool(ds.data_vars):

class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
    ...
    def __bool__(self) -> bool:
        return bool(self.data_vars)

This has the unfortunate property of returning True as long as there is at least one data_variable, regardless of the contents.

Currently, the behavior of Dataset.__bool__ is, at least as far as I've seen, never helpful but frequently unhelpful. I've seen (and written) tests written for DataArrays being passed a Dataset and suddenly the tests are meaningless so many times. Conversely, I've never found a legitimate use case for bool(ds). As far as I can tell, this is essentially the same as len(ds.data_vars) > 0.

In fact, while testing out my proposed changes below on a fork, I found two tests in the xarray test suite that had succumbed to this issue: see #6122 and #6123.

This has been discussed before - see #4290. This discussion focused on the question "should bool(xr.Dataset({'a': False})) return False?". I agree that it's not clear when it should be false and picking a behavior which deviates from Mapping feels arbitrary and gross.

Proposed behavior

I'm proposing that the API be changed, so that bool(xr.Dataset({'a': False})) raise an error, similar to the implementation in pd.Series.

In this implementation in pandas, attempting to evaluate even a single-element series as a boolean raises an error:

In [14]: bool(pd.Series([False]))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-b0ad7f4d9277> in <module>
----> 1 bool(pd.Series([False]))

~/miniconda3/envs/rhodium-env/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
   1532     @final
   1533     def __nonzero__(self):
-> 1534         raise ValueError(
   1535             f"The truth value of a {type(self).__name__} is ambiguous. "
   1536             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I understand hesitancy around changing the core API. That said, if anyone can find an important, correct use of bool(ds) in the wild I'll eat my hat :)

Implementation

This could be as simple as raising an error on ds.__bool__, something like:

class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
    ...
    def __bool__(self) -> bool:
        raise ValueError(
            "The truth value of a Dataset is ambiguous. Reduce the data "
            "variables to a scalar value with any(ds.values()) or "
            "all(ds.values())."
        )

The only other change that would be needed is an assertion that directly calls bool(ds) in test_dataset::TestDataset.test_properties, which checks for the exact behavior I'm changing:

        assert bool(ds)

This would need to be changed to:

        with pytest.raises(ValueError):
            bool(ds)

If this sounds good, I can submit a PR with these changes.

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions