Skip to content

API: Index and Array constructors design #23212

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

To split off the discussion on the constructors from #23185, to have a more focussed discussion about that here. Also going further on the discussion we were having in https://github.com/pandas-dev/pandas/pull/23140/files#r225218594

So topic of this issue: how should the different constructors look like for the internal EAs and the Index classes based on those EAs (specifically for the datetimelike ones).

Index constructors

I think the for the Index constructors, there is not that much discussion.
We have:

  • default Index(..) (__new__ or __init__): this is quite overloaded for some of the index classes, but that's the way it is now since they are exposed to the user.

  • _simple_new: I think we agree that for those (from Tom's comment here REF: Simplify Period/Datetime Array/Index constructors #23093 (review)), it should basically simply get the EA array and potentially a name:

    @classmethod
    def _simple_new(cls, values, name=None):
        # type: (Union[ndarray, ExtensionArray], Optional[Any]) -> Index
        result = object.__new__(cls)
        result._data = values
        result.name = name
        result._reset_identity()
        return result
    
  • _shallow_copy and _shallow_copy_with_infer might need another look to propose something.

Array constructors

The default Index constructors mix a lot of different things (which is what partly lead to the suite of other constructors), and I personally don't think this is something we necessarily need to repeat for the Array constructors.

In the discussion related to

Each Array type might have it specific constructors (like we have IntervalArray.from_breaks and others), but I think that in the discussion we were having in https://github.com/pandas-dev/pandas/pull/23140/files#r225218594, there are 3 clearly defined use case that are generic for the different dtypes. Constructing from:

  1. physical values (ordinals + freq for Period, datetime64 + optional tz for Datetime, int ndarray + mask for IntegerArray)
  2. extension Array (i.e. accept itself)
  3. array of scalars (eg an object ndarray of Period or Timestamp objects)

For this last item, we already _from_sequence for exactly this as part of the EA interface.

So one option is simply accept all of those three things in the main Array __init__, another option is to have separate constructors for them. I think this is what the discussion is mainly about?

I see the following advantages of keeping them separate (or at least keep the third item separate):

  • Code clarity throughout the Array implementation: To quote Tom from (REF: Simplify Datetimelike constructor dispatching #23140 (comment)):

    From the WIP PeriodArray PR, I found that having to think carefully about what type of data I had forced some clarity in the code. I liked having to explicitly reach for that _from_periods constructor.

  • Keep concerns separated inside the constructor -> code clarity in the constructors itself
  • We already decided that we will not rely on the default constructor in the EA interface but rather have a specific _from_sequence, _from_factorized, so we cannot use it anyway in places that need to deal with EAs in general

Also note that this is basically what we have for the new IntegerArray. It's __init__ only accepts a ndarray of integers + mask, and there is a separate function integer_array that provides a more general purpose constructor (from list, from floats, detecting NaNs as missing values, etc ..), which is then used in _from_sequence.

cc @TomAugspurger @jreback @jbrockmendel

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignClosing CandidateMay be closeable, needs more eyeballsConstructorsSeries/DataFrame/Index/pd.array Constructors

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions