Description
Context: the future string dtype for 3.0 (currently enabled with pd.options.futu.infer_string = True
) is being formalized in a PDEP in #58551, and one of the discussion points is how to name the different variants of the StringDtype that will exist with the PDEP (whether using pyarrow or numpy object dtype for the data under the hood, and whether using NA or NaN as missing value sentinel).
As explained in #54792 (comment), we introduced the NaN-variant of the dtype for 3.0 as pd.StringDtype(storage="pyarrow_numpy")
because we wanted to reuse the storage
keyword but "pyarrow"
is already taken (by the dtype using pd.NA
introduced in pandas 1.3), and because we couldn't think of a better name at the time. But as also mentioned back then, that is far from a great name.
But as mentioned by @jbrockmendel in #58551 (comment), we don't necessarily need to reuse just the storage
keyword, but we could also add new keywords to distinguish the dtype variants.
That got me thinking and listing some possible options here:
- Add an extra keyword that distinguishes the NA sentinel (and with that implicitly the type of missing value semantics):
- Possible names for
pd.StringDtype(storage="python"|"pyarrow", <something>)
:semantics="numpy"
(and the other would then be "nullable" or 'arrow" or ..?)na_value=np.nan
na_marker=np.nan
missing=np.nan
nullable=False
(although we have used "nullable dtypes" in the past to denote the dtypes using NA, it's also confusing here because the False variant does support missing values as well)
- One drawback here that I don't think users should actually ever explicitly do
pd.StringDtype(storage="pyarrow", na_value=np.nan)
as that is not future proof. But defaulting tona_value=np.nan
(to avoid requiring to specify it) is then not backwards compatible with currentpd.StringDtype(storage="pyarrow")
- Possible names for
- Add a new keyword separate from
storage
to determine the storage/backend that only controls the new variants with NaN.- Given we are using
storage
right now, but speak about "backend" in other places, we could add for example abackend
keyword, whereStringDtype(storage="python"|"pyarrow")
keeps resulting in the dtypes using NA (backwards compatible), while doingStringDtype(backend="python"|"pyarrow")
gives you the new dtypes using NaN (and specifying both then obviously errors) - This is not great API design to have two keywords that are mutually exclusive but are essentially controlling the same thing, but, it does avoid having to specify two keywords (or having the confusing names)
- One question is which keyword name to use.
backend
has prior use in the "dtypes_backend" terminology. Irv suggestednature
below.
- Given we are using
- For completeness, we can also still come up with a better
storage
name than"pyarrrow_numpy"
and stick to that single existing keyword. Suggestions from the PDEP PR:"pyarrow_nan"
"pyarrow_legacy"
(I wouldn't go with this one, because for users it is not "legacy" right now, rather it would be the default. It will only become legacy later if we decide on switching to NA later)
After writing this down, I think my current preference would go to StringDtype(backend="python"|"pyarrow")
, as that seems the simplest for most users (it's a bit confusing for those who already explicitly used storage
, but most users have never done that)