Description
To split off the discussion on the constructors from #23185, to have a more focussed discussion about that here. Also going further on the discussion we were having in https://github.com/pandas-dev/pandas/pull/23140/files#r225218594
So topic of this issue: how should the different constructors look like for the internal EAs and the Index classes based on those EAs (specifically for the datetimelike ones).
Index constructors
I think the for the Index constructors, there is not that much discussion.
We have:
-
default
Index(..)
(__new__
or__init__
): this is quite overloaded for some of the index classes, but that's the way it is now since they are exposed to the user. -
_simple_new
: I think we agree that for those (from Tom's comment here REF: Simplify Period/Datetime Array/Index constructors #23093 (review)), it should basically simply get the EA array and potentially a name:@classmethod def _simple_new(cls, values, name=None): # type: (Union[ndarray, ExtensionArray], Optional[Any]) -> Index result = object.__new__(cls) result._data = values result.name = name result._reset_identity() return result
-
_shallow_copy
and_shallow_copy_with_infer
might need another look to propose something.
Array constructors
The default Index constructors mix a lot of different things (which is what partly lead to the suite of other constructors), and I personally don't think this is something we necessarily need to repeat for the Array constructors.
In the discussion related to
Each Array type might have it specific constructors (like we have IntervalArray.from_breaks
and others), but I think that in the discussion we were having in https://github.com/pandas-dev/pandas/pull/23140/files#r225218594, there are 3 clearly defined use case that are generic for the different dtypes. Constructing from:
- physical values (ordinals + freq for Period, datetime64 + optional tz for Datetime, int ndarray + mask for IntegerArray)
- extension Array (i.e. accept itself)
- array of scalars (eg an object ndarray of Period or Timestamp objects)
For this last item, we already _from_sequence
for exactly this as part of the EA interface.
So one option is simply accept all of those three things in the main Array __init__
, another option is to have separate constructors for them. I think this is what the discussion is mainly about?
I see the following advantages of keeping them separate (or at least keep the third item separate):
- Code clarity throughout the Array implementation: To quote Tom from (REF: Simplify Datetimelike constructor dispatching #23140 (comment)):
From the WIP PeriodArray PR, I found that having to think carefully about what type of data I had forced some clarity in the code. I liked having to explicitly reach for that _from_periods constructor.
- Keep concerns separated inside the constructor -> code clarity in the constructors itself
- We already decided that we will not rely on the default constructor in the EA interface but rather have a specific
_from_sequence
,_from_factorized
, so we cannot use it anyway in places that need to deal with EAs in general
Also note that this is basically what we have for the new IntegerArray. It's __init__
only accepts a ndarray of integers + mask, and there is a separate function integer_array
that provides a more general purpose constructor (from list, from floats, detecting NaNs as missing values, etc ..), which is then used in _from_sequence
.