Skip to content

Default string dtype (PDEP-14): naming convention to distinguish the dtype variants #58613

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Context: the future string dtype for 3.0 (currently enabled with pd.options.futu.infer_string = True) is being formalized in a PDEP in #58551, and one of the discussion points is how to name the different variants of the StringDtype that will exist with the PDEP (whether using pyarrow or numpy object dtype for the data under the hood, and whether using NA or NaN as missing value sentinel).

As explained in #54792 (comment), we introduced the NaN-variant of the dtype for 3.0 as pd.StringDtype(storage="pyarrow_numpy") because we wanted to reuse the storage keyword but "pyarrow" is already taken (by the dtype using pd.NA introduced in pandas 1.3), and because we couldn't think of a better name at the time. But as also mentioned back then, that is far from a great name.

But as mentioned by @jbrockmendel in #58551 (comment), we don't necessarily need to reuse just the storage keyword, but we could also add new keywords to distinguish the dtype variants.

That got me thinking and listing some possible options here:

  • Add an extra keyword that distinguishes the NA sentinel (and with that implicitly the type of missing value semantics):
    • Possible names for pd.StringDtype(storage="python"|"pyarrow", <something>):
      • semantics="numpy" (and the other would then be "nullable" or 'arrow" or ..?)
      • na_value=np.nan
      • na_marker=np.nan
      • missing=np.nan
      • nullable=False (although we have used "nullable dtypes" in the past to denote the dtypes using NA, it's also confusing here because the False variant does support missing values as well)
    • One drawback here that I don't think users should actually ever explicitly do pd.StringDtype(storage="pyarrow", na_value=np.nan) as that is not future proof. But defaulting to na_value=np.nan (to avoid requiring to specify it) is then not backwards compatible with current pd.StringDtype(storage="pyarrow")
  • Add a new keyword separate from storage to determine the storage/backend that only controls the new variants with NaN.
    • Given we are using storage right now, but speak about "backend" in other places, we could add for example a backend keyword, where StringDtype(storage="python"|"pyarrow") keeps resulting in the dtypes using NA (backwards compatible), while doing StringDtype(backend="python"|"pyarrow") gives you the new dtypes using NaN (and specifying both then obviously errors)
    • This is not great API design to have two keywords that are mutually exclusive but are essentially controlling the same thing, but, it does avoid having to specify two keywords (or having the confusing names)
    • One question is which keyword name to use. backend has prior use in the "dtypes_backend" terminology. Irv suggested nature below.
  • For completeness, we can also still come up with a better storage name than "pyarrrow_numpy" and stick to that single existing keyword. Suggestions from the PDEP PR:
    • "pyarrow_nan"
    • "pyarrow_legacy" (I wouldn't go with this one, because for users it is not "legacy" right now, rather it would be the default. It will only become legacy later if we decide on switching to NA later)

After writing this down, I think my current preference would go to StringDtype(backend="python"|"pyarrow"), as that seems the simplest for most users (it's a bit confusing for those who already explicitly used storage, but most users have never done that)

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignNeeds DiscussionRequires discussion from core team before further actionStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions