-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support ExtensionArray (and masked EAs speficially) in indexing #39133
Comments
I would like to start discussing this in more detail, especially the third point from the top post:
For me the end goal is that the operations for which we now use the IndexEngine can support masked arrays without any conversion (eg no conversion of nullable integer array to a float array with NaNs or object array, as we are doing now). One option for this that I see is to add a This might be quite invasive, though (especially since missing values in the index is often not that useful anyway .. but we still need to support it), so interested to hear your thoughts about this. |
This seems reasonable. If the existing IndexEngine/hashtable code can be cleanly updated to handle masks, that'd be great. If that is too messy, then implementing mask-specific hashtable subclasses seems like a fine plan B. |
What is the current way to implement an index supporting my custom ExtensionArray? I noticed that when calling |
At the moment there isn't one. You might be able to kludge something together by patching |
Brief call today, it feels like (uncertain) that it'd be hard to do this at the Hashtable level. Probably handle NA values at the Index or IndexEngine level. |
I've been looking at this and am currently thinking of 4 areas that need attention
This introduces trouble for the other _engine-using methods bc e.g. engine.get_loc would return indices on the non-missing subset, which would need to be adjusted. This may be simpler to handle in something like a MaskedIndexEngine, not sure yet. |
Is there any chance to get this feature into v1.4? |
@Hoeze PRs are welcome. core can provide review but features are generally done by interested parties |
I've recently been coming around towards Joris's preference of stuffing EAs into an Index object instead of into ExtensionIndex (xref #43002). But looking at what ExtensionIndex methods can be refactored away I noticed searchsorted
#38103 fixed #38083 which reported a 5x slowdown in |
well i have always been ok with a proper implementation |
@phofl did the MaskedIndexEngine PR close this? |
Not sure, since we might want to support for generic eas eventually, I left this open |
I'm confused by this because we do support generic EAs. |
One area where the general ExtensionArray support is lacking is to store them in the index (right now they get converted to ndarray when storing in an
Index
), and have efficient indexing operations (hashtable, index engines).Several of our long-time extension dtypes have their own subclass (Categorical, Period, Datetime, IntervalIndex), but we need to solve this generally for ExtensionArrays (so it can also work for external EAs), and should also focus on solving it well for the new nullable ExtensionArrays (using masked arrays).
I think there are multiple aspects to this (probably more, but currently thinking of those):
1) Storing ExtensionArrays in an Index object
Supporting to "just" store EAs in the Index and support its methods (and eg falling back to ndarray for the indexing engine) is probably not that hard. There are PRs #34159 (storing EAs in base Index class) and #37869 (having specific ExtensionIndex subclass).
I think both approaches are technically not that different (put the required special cases in
if
blocks in the base class vs in overridden methods in the subclass), but for me it's mainly a user API design discussion (summarized as "I don't think that end users should see an "ExtensionIndex").So for this part, we should have that API discussion.
2) A protocol for specifying the values (ndarray) used for indexing operations
While for an initial version of support, we can use
np.asarray(EA)
as the values passed to theIndexEngine
, we should ideally have a general method in the EA interface to be able to specify which values can be used for indexing.There is some discussion related to this in #32586 and #33276 (eg can we re-use some of the existing
_values_for_..
methods? ...). And we can probably continue this aspect over there.A general method is mostly important for external EAs, because we will probably have special support for our own EAs: the existing Index subclasses already do this, and for the nullable EAs we need to add this (see next section below).
3) Support for masked arrays in the indexing operations (IndexEngine, HashTable, etc)
Specifically to have better support for the nullable dtypes (without needing to convert to ndarray), I think we should look into adding support for using masks in the low-level index operations (IndexEngine, HashTable, etc).
Some (not-index related) hashtable methods like
HashTable.unique
already have optional support for masks.I think this is technically the most challenging item, and needs to be worked out more in detail what this work item would entail.
cc @jbrockmendel @TomAugspurger @jreback
The text was updated successfully, but these errors were encountered: