DOC: add pandas 3.0 migration guide for the string dtype #61705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jorisvandenbossche wants to merge 2 commits into pandas-dev:main from jorisvandenbossche:string-dtype-doc-migration-guide

+273 −0

Member

jorisvandenbossche commented Jun 25, 2025

This PR starts adding a migration guide with some typical issues one might run into regarding the new string dtype when upgrading to pandas 3.0 (or when enabling it in pandas 2.3).

(for now I just added it to the user guide, which is already a long list of pages, so we might need to think about better organizing this or putting it elsewhere)


          DOC: add pandas 3.0 migration guide for the string dtype

975dea1

jorisvandenbossche added this to the 2.3.1 milestone

jorisvandenbossche added Docs Strings labels


          fixup title underline

db42937

simonjayhawkins reviewed

View reviewed changes

Member

simonjayhawkins left a comment

@jorisvandenbossche i'll post these few now rather than doing too many in a batch, but feel free to wait until i'm done, whatever is more convenient for you.

doc/source/user_guide/migration-3-strings.rst

+              Historically, pandas has always used the NumPy ``object`` dtype as the default
+              to store text data. This has two primary drawbacks. First, ``object`` dtype is
+              not specific to strings: any Python object can be stored in an ``object```-dtype

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            not specific to strings: any Python object can be stored in an ``object```-dtype
          
            not specific to strings: any Python object can be stored in an ``object``-dtype

doc/source/user_guide/migration-3-strings.rst

+              not specific to strings: any Python object can be stored in an ``object```-dtype
+              array, not just strings, and seeing ``object`` as the dtype for a column with
+              strings is confusing for users. Second, this is not always very efficient (both
+              performance wise as for memory usage).

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            performance wise as for memory usage).
          
            performance wise and for memory usage).

doc/source/user_guide/migration-3-strings.rst

+              not yet been made the default, and uses the ``pd.NA`` scalar to represent
+              missing values.
+              Pandas 3.0 changes the default dtype for strings to a new string data type,

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            Pandas 3.0 changes the default dtype for strings to a new string data type,
          
            Pandas 3.0 changes the default inferred dtype for strings to a new string data type,

doc/source/user_guide/migration-3-strings.rst

+              missing values.
+              Pandas 3.0 changes the default dtype for strings to a new string data type,
+              a variant of the existing optional string data type but using ``NaN`` as the

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            a variant of the existing optional string data type but using ``NaN`` as the
          
            a variant of the existing optional string data type but using ``np.NaN`` as the

simonjayhawkins reviewed

View reviewed changes

doc/source/user_guide/migration-3-strings.rst


		pd.options.future.infer_string = True

		This allows to test your code before the final 3.0 release.

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            This allows to test your code before the final 3.0 release.
          
            This allows you to test your code before the final 3.0 release.

doc/source/user_guide/migration-3-strings.rst

+              .. - Breaking changes:
+              ..    - dtype is no longer object dtype
+              ..    - None gets coerced to NaN
+              ..    - setitem raises an error for non-string data

Member

simonjayhawkins Jun 25, 2025

the above is not rendered?

doc/source/user_guide/migration-3-strings.rst

+              ..    - None gets coerced to NaN
+              ..    - setitem raises an error for non-string data
+              Brief intro to the new default string dtype

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            Brief intro to the new default string dtype
          
            Brief introduction to the new default string dtype

doc/source/user_guide/migration-3-strings.rst

+NaN
+                 dtype: str
+              In contrast the the current object dtype, the new string dtype will only store

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            In contrast the the current object dtype, the new string dtype will only store
          
            In contrast to the current object dtype, the new string dtype will only store

simonjayhawkins reviewed

View reviewed changes

doc/source/user_guide/migration-3-strings.rst

+              non-string value in it (see below for more details).
+              Missing values with the new string dtype are always represented as ``NaN``, and
+              the missing value behaviour is similar as for other default dtypes.

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            the missing value behaviour is similar as for other default dtypes.
          
            the missing value behavior is similar as for other default dtypes.

do we use US English in the docs?

doc/source/user_guide/migration-3-strings.rst

+              non-string value in it (see below for more details).
+              Missing values with the new string dtype are always represented as ``NaN``, and
+              the missing value behaviour is similar as for other default dtypes.

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            the missing value behaviour is similar as for other default dtypes.
          
            the missing value behaviour is similar to other default dtypes.

doc/source/user_guide/migration-3-strings.rst

+              Missing values with the new string dtype are always represented as ``NaN``, and
+              the missing value behaviour is similar as for other default dtypes.
+              For the rest, this new string dtype should work the same as how you have been

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            For the rest, this new string dtype should work the same as how you have been
          
            This new string dtype should work the same as how you have been

doc/source/user_guide/migration-3-strings.rst


		>>> ser.dtype == "str"

		How to write compatible code?

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            **How to write compatible code?**
          
            **How to write compatible code**

simonjayhawkins reviewed

View reviewed changes

Member

simonjayhawkins left a comment

cool. Thanks @jorisvandenbossche

doc/source/user_guide/migration-3-strings.rst

+                 >>> pd.api.types.is_string_dtype(ser.dtype)
+                 True
+              This will return ``True`` for both the object dtype as for the string dtypes.

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            This will return ``True`` for both the object dtype as for the string dtypes.
          
            This will return ``True`` for both the object dtype and the string dtypes.

doc/source/user_guide/migration-3-strings.rst

+                 True
+              One caveat: this function works both on scalars and on array-likes, and in the
+              latter case it will return an array of boolean dtype. When using it in a boolean

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            latter case it will return an array of boolean dtype. When using it in a boolean
          
            latter case it will return an array of Boolean dtype. When using it in a Boolean

not to confuse with pandas nullable type should capitalize as named after George Boole?

doc/source/user_guide/migration-3-strings.rst

+              .. code-block:: python
+                 >>> ser = pd.Series(["a", "b", None], dtype="str")
+                 >>> ser[1] = 2.5

Member

simonjayhawkins Jun 25, 2025

i notice you can do ser[1] = pd.NA so we are accepting this as a missing value. Should we disallow this or perhaps encourage it instead to perhaps make migration to the pd.NA variant simpler?

doc/source/user_guide/migration-3-strings.rst

+              **How to write compatible code?**
+              You can update your code to ensure you only set string values in such columns,
+              or otherwise you have explicitly ensure the column has object dtype first. This

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            or otherwise you have explicitly ensure the column has object dtype first. This
          
            or otherwise you can explicitly ensure the column has object dtype first. This

doc/source/user_guide/migration-3-strings.rst

+                 >>> ser[1] = 2.5
+              This ``astype("object")`` call will be redundant when using pandas 2.x, but
+              this way such code can work for all versions.

Member

simonjayhawkins Jun 25, 2025

Suggested change

      
            this way such code can work for all versions.
          
            this code will work for all versions.

simonjayhawkins mentioned this pull request

WEB: add note to PDEP-10 about delayed timeline for requiring pyarrow #61706

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels