-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
DOC: add pandas 3.0 migration guide for the string dtype #61705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
DOC: add pandas 3.0 migration guide for the string dtype #61705
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche i'll post these few now rather than doing too many in a batch, but feel free to wait until i'm done, whatever is more convenient for you.
|
||
Historically, pandas has always used the NumPy ``object`` dtype as the default | ||
to store text data. This has two primary drawbacks. First, ``object`` dtype is | ||
not specific to strings: any Python object can be stored in an ``object```-dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not specific to strings: any Python object can be stored in an ``object```-dtype | |
not specific to strings: any Python object can be stored in an ``object``-dtype |
not specific to strings: any Python object can be stored in an ``object```-dtype | ||
array, not just strings, and seeing ``object`` as the dtype for a column with | ||
strings is confusing for users. Second, this is not always very efficient (both | ||
performance wise as for memory usage). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
performance wise as for memory usage). | |
performance wise and for memory usage). |
not yet been made the default, and uses the ``pd.NA`` scalar to represent | ||
missing values. | ||
|
||
Pandas 3.0 changes the default dtype for strings to a new string data type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas 3.0 changes the default dtype for strings to a new string data type, | |
Pandas 3.0 changes the default inferred dtype for strings to a new string data type, |
missing values. | ||
|
||
Pandas 3.0 changes the default dtype for strings to a new string data type, | ||
a variant of the existing optional string data type but using ``NaN`` as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a variant of the existing optional string data type but using ``NaN`` as the | |
a variant of the existing optional string data type but using ``np.NaN`` as the |
|
||
pd.options.future.infer_string = True | ||
|
||
This allows to test your code before the final 3.0 release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows to test your code before the final 3.0 release. | |
This allows you to test your code before the final 3.0 release. |
.. - Breaking changes: | ||
.. - dtype is no longer object dtype | ||
.. - None gets coerced to NaN | ||
.. - setitem raises an error for non-string data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the above is not rendered?
.. - None gets coerced to NaN | ||
.. - setitem raises an error for non-string data | ||
|
||
Brief intro to the new default string dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brief intro to the new default string dtype | |
Brief introduction to the new default string dtype |
2 NaN | ||
dtype: str | ||
|
||
In contrast the the current object dtype, the new string dtype will only store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In contrast the the current object dtype, the new string dtype will only store | |
In contrast to the current object dtype, the new string dtype will only store |
non-string value in it (see below for more details). | ||
|
||
Missing values with the new string dtype are always represented as ``NaN``, and | ||
the missing value behaviour is similar as for other default dtypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the missing value behaviour is similar as for other default dtypes. | |
the missing value behavior is similar as for other default dtypes. |
do we use US English in the docs?
non-string value in it (see below for more details). | ||
|
||
Missing values with the new string dtype are always represented as ``NaN``, and | ||
the missing value behaviour is similar as for other default dtypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the missing value behaviour is similar as for other default dtypes. | |
the missing value behaviour is similar to other default dtypes. |
Missing values with the new string dtype are always represented as ``NaN``, and | ||
the missing value behaviour is similar as for other default dtypes. | ||
|
||
For the rest, this new string dtype should work the same as how you have been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the rest, this new string dtype should work the same as how you have been | |
This new string dtype should work the same as how you have been |
|
||
>>> ser.dtype == "str" | ||
|
||
**How to write compatible code?** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**How to write compatible code?** | |
**How to write compatible code** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. Thanks @jorisvandenbossche
>>> pd.api.types.is_string_dtype(ser.dtype) | ||
True | ||
|
||
This will return ``True`` for both the object dtype as for the string dtypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return ``True`` for both the object dtype as for the string dtypes. | |
This will return ``True`` for both the object dtype and the string dtypes. |
True | ||
|
||
One caveat: this function works both on scalars and on array-likes, and in the | ||
latter case it will return an array of boolean dtype. When using it in a boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latter case it will return an array of boolean dtype. When using it in a boolean | |
latter case it will return an array of Boolean dtype. When using it in a Boolean |
not to confuse with pandas nullable type should capitalize as named after George Boole?
.. code-block:: python | ||
|
||
>>> ser = pd.Series(["a", "b", None], dtype="str") | ||
>>> ser[1] = 2.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i notice you can do ser[1] = pd.NA
so we are accepting this as a missing value. Should we disallow this or perhaps encourage it instead to perhaps make migration to the pd.NA variant simpler?
**How to write compatible code?** | ||
|
||
You can update your code to ensure you only set string values in such columns, | ||
or otherwise you have explicitly ensure the column has object dtype first. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or otherwise you have explicitly ensure the column has object dtype first. This | |
or otherwise you can explicitly ensure the column has object dtype first. This |
>>> ser[1] = 2.5 | ||
|
||
This ``astype("object")`` call will be redundant when using pandas 2.x, but | ||
this way such code can work for all versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this way such code can work for all versions. | |
this code will work for all versions. |
This PR starts adding a migration guide with some typical issues one might run into regarding the new string dtype when upgrading to pandas 3.0 (or when enabling it in pandas 2.3).
(for now I just added it to the user guide, which is already a long list of pages, so we might need to think about better organizing this or putting it elsewhere)