-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: indexing with missing labels deprecation not applied to MultiIndex #39424
Comments
Hi, thanks for your report. @jbrockmendel I think this is somehow intended? I remember vaguely that we have discussed this in the past |
this is as intended. if you have a label that is not in the index, you must use |
@jreback i think the issue is that the first case doesn't raise, not that the second case does |
To give this even more context (possibly a bug) @jbrockmendel @jreback :
The error message hints that this behaviour is known or at least expected:
I took a look but the |
@attack68 I might be able to help a bit here. In case of a nested tuple as indexer, we dispatch to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.get_locs.html As you already mentioned this is a bit inconsistent. I try to write the cases we encounter up:
The ways get_locs is implemented this is quite hard to fix with a meaningful error message. Especially the cases where something like |
Hi @phofl: reopening this to discuss the how MultiIndex generates KeyIndex. I have actually changed my mind in the interim for good reason. My original argument was consistency, but I think a better argument for now avoiding KeyError is practical use and no availability to reindex with a MultiIndex. (also take a look here #39775). In summary for dealing with missing key in a single index the documented recommendation is For a MultiIndex, say:
then if you intend to index with say: ["f", :] or [(slice(None), 4), :] then you would get a KeyError, so you specifically have to ensure you don't do that by testing for key presence in the specific You cannot simple use
Thus, I think the behaviour should be considered very carefully for how MultiIndex slicing should operate. |
Let's take a step back first please, because I've got a question what you would expected when reindexing a MultiIndex. The behavior you are describing basically comes down to:
which returns
with levels
So the issue you mentioned above does not really exist here, since "c" is added as an unused level? Would you expect something different here? |
Ok good point, actually I haven't revisited this for a while, and the last time I tried re-indexing a MultiIndex it didn't quite work out. So good to start with the basics. Yes I think that "c" getting added as an unused level value is probably the best way to go. This, of course, differs to a SingleIndex, where "c" would be visible and the missing data would be inserted as NaN. With a single index Keys are either present or not-present and that is the determinant in a KeyError. In a MultiIndex level keys are either present or not-present in each level with the potential that a key containing present-level values across all levels is actually not-present in the MultiIndex, e.g. ("a", 3). Using your example consider the following:
I think there are lots of combinations here worth reviewing but some obvious ones:
My risk averse opinion is that if raising KeyError is to be more useful than returning an empty dataframe it must be consistently applied. If not, the tail risk bad cases are increased. That I wanted to raise a cautionary flag against PRs that add more KeyErrors to MultiIndexes. Is there any kind of developer document on how multiindex .loc is supposed to deal with all the possible combinations? |
I don't think so apart from missing keys should raise a KeyError. I am not sure why you would do the following:
Reindexing already selects the wanted labels, so you don't need loc anymore. You could do Apart from C I would argue that every one of these cases should raise. But we are not that consistent here. This is something we should work on in the future. I think C should raise too probably, but the underlying implementation makes this hard. Would probably need a do over to be able to do this correctly. I'll try to look into this in the coming weeks, maybe we can get this mor consistent with 1.3 |
if And what about if
I'll try and add something to the discussion also, maybe a formal working doc on the ideas and working MultiIndexes may prove helpful in the future, especially if new devs come in, since the mechanics is quite complex for a user let alone a developer I expect! |
@phofl, I apologise for the long message here, but I think having a developer specification in the code/docs somewhere for how this works is needed and the below provides a starting framework (even if this is greatly re-worked). It also collects most of my opinion into a consistent structure. Pandas Indexing with
|
No need to apologize. I agree, this would be useful to have in the docs. Thanks for writing this up. I agree with most of the things you have written. This would make the behavior consistent for MultiIndexes.
I don't think this is the recommended way to do this. You should do Hence doing this allows us to raise when having unused levels without restricting functionality from a user perspective. I think we should raise on unused levels, because reindex is not the only way to create unused levels in a MultiIndex. This can also happen accidentially (see #41362 and associated issues) and I don't think something which is done for performance reasons (see link in #36227) should have impact on the behavior here. Imo we should move something like this (after we agreed on the content of course) to the docs somewhere. I do think this should be visible to users too. |
But
For trying to exemplify this with a MultiIndex, let me use a working case in finance where 3 assets have different trading calendars, and possibly different collected data:
In my proposal you would get the following:
but if you can reindex without necessarily adding any keys:
I see there are downstream issues to this decision - and I'm keeping an open mind! If unused levels are raised it would be good and interesting to finds the downstream implications it has also. |
I think I understand what you are trying to do, but I am still not sure what you are needing loc for. Could you give me a usecase why we would need loc after using reindex and the unused level thing becomes a problem? Sorry if this is too obvious, but I can not see it. Edit: Everything I can think of right now can be done without running into the KeyError, if you really need the loc call after reindexing |
The workarounds discussed focus on reindex, but I think
Granted that's more verbose than a single loc call. Also it's not quite right if ser.index.levels[2] is an IntervalIndex, which is a whole other can of worms. I'd be open to having a method to make this more concise, but am leaning towards preferring it not be |
For big data with multiindexes with a lot of levels the
Maybe this is best way. I really like the flexibility of |
Another case I've run into (@attack68 LMK if this belongs in a different thread) In #27591 we have a case where a level contains a tuple, but incorrectly goes through the get_locs path and returns a empty Series
We can fix this by checking for this case before dispatching to |
Shoot, it gets even worse: what if we have multiple levels like lev2? |
@jbrockmendel here is a pathological example for you:
! I would suggest that since Since the tuple (and only the tuple) is used as the notation by which to identify axes and levels within loc this then provides a complete separation. This is also supported by the fact it is not possible to construct index levels values from lists or sets:
In this regime the following two commands have separate and distinct meaning:
(not entirely sure why you can't use Note the workaround for this pathological example is:
edit: with so much complexity involved in these matters I also think it is advantageous to restrict input format to a very specific set of instructions to help not only with coding but with educating how to use. |
Sounds like you're suggesting we deprecate the behavior for a tuple but retain it for other listlike? |
Besides backwards compatability, is there anything that would be affected by deprecating being able to use tuples in loc to also mean a sequence of labels, as opposed to just defining the axis and levels structure, and specific level value keys? |
The only downside that comes to mind is that we would be deprecating it here but not in a bunch of other places where tuples can be used to indicate sequences (id be on board for deprecating that across the board, but that's a daunting task) |
Ignoring some of the digressive comments in this thread and just analysing the title, of implementing "KeyError" consistently, running my script reveals the following open areas for improvement:
This also shows the same issue for Monotonic MultiIndexes.
|
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
Should this behaviour be consistent?
A recent decision to move to KeyError: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
Expected Output
Both return KeyError?
This can be linked to #32125 also.
The text was updated successfully, but these errors were encountered: