Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: interaction between pd.options.future.infer_string and default string storage (pd.options.mode.string_storage) #54793

Closed
jorisvandenbossche opened this issue Aug 28, 2023 · 2 comments · Fixed by #54794
Labels
Strings String extension data type and string data

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 28, 2023

From #54533 (comment)

The future default string dtype (#54792) can be enabled with pd.options.future.infer_string = True, and then pandas will use the StringDtype(storage="pyarrow_numpy") dtype in constructors and IO methods.

However, we also have an option to set the default storage for this StringDtype (pd.options.mode.string_storage), which isn't changed by setting the future option, and thus still uses its default value of "python". As a result, when someone specifies the generic "string" dtype (without explicit parametrization), we still default to this python-based string dtype.

Some examples:

>>> pd.options.future.infer_string = True
# this is still its default of "python"
>>> pd.options.mode.string_storage
'python'

# the default inference (not specifying a dtype) gives the new pyarrow based dtype
>>> ser = pd.Series(["a", "b", None])
>>> ser
0      a
1      b
2    NaN
dtype: string
>>> ser.dtype
string[pyarrow_numpy]

# but when specifying generically to want a "string" dtype, we still use the python based dtype
>>> ser = pd.Series(["a", "b", None], dtype="string")
>>> ser
0       a
1       b
2    <NA>
dtype: string
>>> ser.dtype
string[python]

The same applies to use the pd.StringDtype() generic dtype constructor instead of the "string" string, and in other places where you can specify the data type (eg .astype("string")).

When opting in to the future default string dtype, IMO the ideal (and expected) behaviour is that for things like dtype="string", the user also gets the pyarrow-based string dtype, without having to manually set two options (i.e. also set pd.options.mode.string_storage = "pyarrow_numpy", in addition to infer_strings).

One "easy" way to change this would be to let pd.options.future.infer_strings = True have a side effect of also changing the option value for string_storage. However, that might give unexpected results when for example using this option in a context manager (because I don't think we can reliably also restore the string_storage option to its original value, when setting infer_strings back to False).

@phofl
Copy link
Member

phofl commented Aug 28, 2023

I think I misunderstood you when we discussed this initially. This makes sense.

The safest way of doing this is when inferring the string storage in the StringDtype constructor.

One open question:

What do we want to do if the users sets pd.options.mode.string_storage explicitly? Should this take precedence?

@jorisvandenbossche
Copy link
Member Author

The safest way of doing this is when inferring the string storage in the StringDtype constructor.

Ah, yes, just checking both options there, that sounds good

What do we want to do if the users sets pd.options.mode.string_storage explicitly? Should this take precedence?

I think the most useful is to let infer_strings take precedence. I don't see a good use case for wanting another default string_storage when using infer_strings. It might only be a bit confusing for users that do that and don't see the expected effect (we could let it warn, but that means we would need to change the actual default for string_storage to None, to know if it was set by the user or not, which might be a bit complicated just for this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants