Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable using a boolean loc in a non-boolean index #52102

Open
1 of 3 tasks
alonme opened this issue Mar 21, 2023 · 15 comments
Open
1 of 3 tasks

ENH: Enable using a boolean loc in a non-boolean index #52102

alonme opened this issue Mar 21, 2023 · 15 comments
Assignees
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@alonme
Copy link
Contributor

alonme commented Mar 21, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

When using loc to access a boolean value on a DataFrame which has an index that contains both booleans and non-booleans (so the dtype of the index is not bool) an error is raised

import pandas as pd

df = pd.DataFrame({'a': [True, False, "Missing"]}).set_index("a")
df.loc[True]

raises

KeyError: 'True: boolean label can not be used without a boolean index'

I am unsure why this check is performed in _validate_key, but this was an unexpected behavior for me, especially as this works with other combinations of types (int and string for example)

Feature Description

I am not sure why this is currently not supported, so i am unsure if there is a need for a new api or we can use the current one for .loc

Alternative Solutions

df = pd.DataFrame({'a': [True, False, "Missing"]}).set_index("a")
df[df.index == True].loc[True]

This is my current solution, but i don't believe this is a good one

Additional Context

No response

@alonme alonme added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 21, 2023
@DeaMariaLeon DeaMariaLeon added Indexing Related to indexing on series/frames, not to indexes themselves API Design and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 21, 2023
@Rylie-W
Copy link
Contributor

Rylie-W commented Mar 27, 2023

take

@phofl
Copy link
Member

phofl commented Mar 30, 2023

@jbrockmendel thoughts here? I am not really sure if this is a good idea since normally Boolean elements are a mask

@jbrockmendel
Copy link
Member

I'd expect this to work. The desired behavior seems pretty unambiguous.

@phofl
Copy link
Member

phofl commented Mar 30, 2023

If you provide a list of booleans it gets a bit sketchy?

@jbrockmendel
Copy link
Member

If you provide a list of booleans it gets a bit sketchy?

Yah. I think in principle we wouldn't allow a mask with .loc.

@Rylie-W
Copy link
Contributor

Rylie-W commented Mar 30, 2023

If you provide a list of booleans it gets a bit sketchy?

Yah. I think in principle we wouldn't allow a mask with .loc.

So in the future, a list of bool elements could be considered as valid for loc only if the index column(s) has one or more bool elements, otherwise it would raise KeyError. And it doesn't check if the length of the list of booleans is the same as the row axis.

Do I understand it right?

@phofl
Copy link
Member

phofl commented Mar 30, 2023

Not allowing a mask would force chained indexing though? How would you filter the columns with a mask then?

@jbrockmendel
Copy link
Member

Not allowing a mask would force chained indexing though?

Good point. That's a tough one.

Not allowing a mask would force chained indexing though? How would you filter the columns with a mask then?

wouldn't iloc work?

@phofl
Copy link
Member

phofl commented Mar 30, 2023

Fair but not ideal I guess for row selection

@alonme
Copy link
Contributor Author

alonme commented May 7, 2023

Any news on this?

@Rylie-W
Copy link
Contributor

Rylie-W commented May 9, 2023

I have implemented the feature and am working on fixing and creating the corresponding unit tests.

As others have previously pointed out, there may be ambiguity when the input is a boolean array. One possible solution is to treat the boolean array as labels to search first. If there are no boolean elements in the index column, we can then check if its length matches that of the DataFrame object. If it does match, it will be treated as a mask, otherwise an error will be thrown. And I'll implement it if the maintainer agree with the solution or have a better idea.

@jbrockmendel
Copy link
Member

Not allowing a mask would force chained indexing though? How would you filter the columns with a mask then?

If we eventually get laziness in the filtering (xref #52980) then the chained indexing wouldn't be so much of a problem (though i guess it still would for setitem)

@phofl
Copy link
Member

phofl commented May 11, 2023

Chained indexing and cow doesn’t work, so this would be tricky but could be worked around with laziness I guess

@Charlie-XIAO
Copy link
Contributor

Charlie-XIAO commented May 18, 2023

Hi @Rylie-W, as you have mentioned in #53267, the following should work (though you haven't tried it due to installation issues).

if isinstance(key, bool) and not (
    ax.dtype.kind in ["b", "O"]
    or isinstance(ax, MultiIndex) and ax._levels[0].dtype.kind in ["b", "O"]
):
    raise KeyError(
        f"{key}: boolean label can not be used without a boolean index"
    )

I have confirmed that with the change above, we can have the following (IMO is the desired behavior).

>>> import pandas as pd
>>> pd.__version__
'2.1.0.dev0+796.gb2bb68a88c.dirty'
>>> df = pd.DataFrame(
...     {"a": [0, 1, True, False, "Missing"], "b": [1, 2, 3, 4, 5]}
... ).set_index("a")
>>> df
         b
a
0        1
1        2
True     3
False    4
Missing  5
>>> df.loc[1]
      b
a
1     2
True  3
>>> df.loc[True]
      b
a
1     2
True  3

Are you making a PR for this? Otherwise I can help as well.

@Rylie-W
Copy link
Contributor

Rylie-W commented May 18, 2023

Hi @Rylie-W, as you have mentioned in #53267, the following should work (though you haven't tried it due to installation issues).

if isinstance(key, bool) and not (
    ax.dtype.kind in ["b", "O"]
    or isinstance(ax, MultiIndex) and ax._levels[0].dtype.kind in ["b", "O"]
):
    raise KeyError(
        f"{key}: boolean label can not be used without a boolean index"
    )

I have confirmed that with the change above, we can have the following (IMO is the desired behavior).

>>> import pandas as pd
>>> pd.__version__
'2.1.0.dev0+796.gb2bb68a88c.dirty'
>>> df = pd.DataFrame(
...     {"a": [0, 1, True, False, "Missing"], "b": [1, 2, 3, 4, 5]}
... ).set_index("a")
>>> df
         b
a
0        1
1        2
True     3
False    4
Missing  5
>>> df.loc[1]
      b
a
1     2
True  3
>>> df.loc[True]
      b
a
1     2
True  3

Are you making a PR for this? Otherwise I can help as well.

Yes, I'm working on that and adding some tests for it. Thanks for the confirmation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants