Skip to content

API: interaction between pd.options.future.infer_string and default string storage (pd.options.mode.string_storage) #54793

Closed
@jorisvandenbossche

Description

From #54533 (comment)

The future default string dtype (#54792) can be enabled with pd.options.future.infer_string = True, and then pandas will use the StringDtype(storage="pyarrow_numpy") dtype in constructors and IO methods.

However, we also have an option to set the default storage for this StringDtype (pd.options.mode.string_storage), which isn't changed by setting the future option, and thus still uses its default value of "python". As a result, when someone specifies the generic "string" dtype (without explicit parametrization), we still default to this python-based string dtype.

Some examples:

>>> pd.options.future.infer_string = True
# this is still its default of "python"
>>> pd.options.mode.string_storage
'python'

# the default inference (not specifying a dtype) gives the new pyarrow based dtype
>>> ser = pd.Series(["a", "b", None])
>>> ser
0      a
1      b
2    NaN
dtype: string
>>> ser.dtype
string[pyarrow_numpy]

# but when specifying generically to want a "string" dtype, we still use the python based dtype
>>> ser = pd.Series(["a", "b", None], dtype="string")
>>> ser
0       a
1       b
2    <NA>
dtype: string
>>> ser.dtype
string[python]

The same applies to use the pd.StringDtype() generic dtype constructor instead of the "string" string, and in other places where you can specify the data type (eg .astype("string")).

When opting in to the future default string dtype, IMO the ideal (and expected) behaviour is that for things like dtype="string", the user also gets the pyarrow-based string dtype, without having to manually set two options (i.e. also set pd.options.mode.string_storage = "pyarrow_numpy", in addition to infer_strings).

One "easy" way to change this would be to let pd.options.future.infer_strings = True have a side effect of also changing the option value for string_storage. However, that might give unexpected results when for example using this option in a context manager (because I don't think we can reliably also restore the string_storage option to its original value, when setting infer_strings back to False).

Metadata

Assignees

No one assigned

    Labels

    StringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions