Skip to content

DOC: add pandas 3.0 migration guide for the string dtype #61705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

This PR starts adding a migration guide with some typical issues one might run into regarding the new string dtype when upgrading to pandas 3.0 (or when enabling it in pandas 2.3).

(for now I just added it to the user guide, which is already a long list of pages, so we might need to think about better organizing this or putting it elsewhere)

@jorisvandenbossche jorisvandenbossche added this to the 2.3.1 milestone Jun 25, 2025
@jorisvandenbossche jorisvandenbossche added Docs Strings String extension data type and string data labels Jun 25, 2025
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche i'll post these few now rather than doing too many in a batch, but feel free to wait until i'm done, whatever is more convenient for you.


Historically, pandas has always used the NumPy ``object`` dtype as the default
to store text data. This has two primary drawbacks. First, ``object`` dtype is
not specific to strings: any Python object can be stored in an ``object```-dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
not specific to strings: any Python object can be stored in an ``object```-dtype
not specific to strings: any Python object can be stored in an ``object``-dtype

not specific to strings: any Python object can be stored in an ``object```-dtype
array, not just strings, and seeing ``object`` as the dtype for a column with
strings is confusing for users. Second, this is not always very efficient (both
performance wise as for memory usage).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
performance wise as for memory usage).
performance wise and for memory usage).

not yet been made the default, and uses the ``pd.NA`` scalar to represent
missing values.

Pandas 3.0 changes the default dtype for strings to a new string data type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pandas 3.0 changes the default dtype for strings to a new string data type,
Pandas 3.0 changes the default inferred dtype for strings to a new string data type,

missing values.

Pandas 3.0 changes the default dtype for strings to a new string data type,
a variant of the existing optional string data type but using ``NaN`` as the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a variant of the existing optional string data type but using ``NaN`` as the
a variant of the existing optional string data type but using ``np.NaN`` as the


pd.options.future.infer_string = True

This allows to test your code before the final 3.0 release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This allows to test your code before the final 3.0 release.
This allows you to test your code before the final 3.0 release.

.. - Breaking changes:
.. - dtype is no longer object dtype
.. - None gets coerced to NaN
.. - setitem raises an error for non-string data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above is not rendered?

.. - None gets coerced to NaN
.. - setitem raises an error for non-string data

Brief intro to the new default string dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Brief intro to the new default string dtype
Brief introduction to the new default string dtype

2 NaN
dtype: str

In contrast the the current object dtype, the new string dtype will only store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In contrast the the current object dtype, the new string dtype will only store
In contrast to the current object dtype, the new string dtype will only store

non-string value in it (see below for more details).

Missing values with the new string dtype are always represented as ``NaN``, and
the missing value behaviour is similar as for other default dtypes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the missing value behaviour is similar as for other default dtypes.
the missing value behavior is similar as for other default dtypes.

do we use US English in the docs?

non-string value in it (see below for more details).

Missing values with the new string dtype are always represented as ``NaN``, and
the missing value behaviour is similar as for other default dtypes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the missing value behaviour is similar as for other default dtypes.
the missing value behaviour is similar to other default dtypes.

Missing values with the new string dtype are always represented as ``NaN``, and
the missing value behaviour is similar as for other default dtypes.

For the rest, this new string dtype should work the same as how you have been
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For the rest, this new string dtype should work the same as how you have been
This new string dtype should work the same as how you have been


>>> ser.dtype == "str"

**How to write compatible code?**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**How to write compatible code?**
**How to write compatible code**

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. Thanks @jorisvandenbossche

>>> pd.api.types.is_string_dtype(ser.dtype)
True

This will return ``True`` for both the object dtype as for the string dtypes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This will return ``True`` for both the object dtype as for the string dtypes.
This will return ``True`` for both the object dtype and the string dtypes.

True

One caveat: this function works both on scalars and on array-likes, and in the
latter case it will return an array of boolean dtype. When using it in a boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
latter case it will return an array of boolean dtype. When using it in a boolean
latter case it will return an array of Boolean dtype. When using it in a Boolean

not to confuse with pandas nullable type should capitalize as named after George Boole?

.. code-block:: python

>>> ser = pd.Series(["a", "b", None], dtype="str")
>>> ser[1] = 2.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i notice you can do ser[1] = pd.NA so we are accepting this as a missing value. Should we disallow this or perhaps encourage it instead to perhaps make migration to the pd.NA variant simpler?

**How to write compatible code?**

You can update your code to ensure you only set string values in such columns,
or otherwise you have explicitly ensure the column has object dtype first. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
or otherwise you have explicitly ensure the column has object dtype first. This
or otherwise you can explicitly ensure the column has object dtype first. This

>>> ser[1] = 2.5

This ``astype("object")`` call will be redundant when using pandas 2.x, but
this way such code can work for all versions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this way such code can work for all versions.
this code will work for all versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants