Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-14: Dedicated string data type for pandas 3.0 #58551

Merged
Merged
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
fbeb69d
PDEP: Dedicated string data type for pandas 3.0
jorisvandenbossche May 3, 2024
f03f54d
small textual edits and typos
jorisvandenbossche May 3, 2024
561de87
address part of the feedback
jorisvandenbossche May 5, 2024
86f4e51
Update web/pandas/pdeps/00xx-string-dtype.md
jorisvandenbossche May 5, 2024
30c7b43
rename file
jorisvandenbossche May 13, 2024
54a43b3
expand Missing value semantics section
jorisvandenbossche May 13, 2024
5b5835b
expand Naming subsection with storage+na_value proposal
jorisvandenbossche May 13, 2024
9ede2e6
Expand Backward compatibility section + add proposal for deprecation
jorisvandenbossche May 13, 2024
f5faf4e
update timeline
jorisvandenbossche May 13, 2024
f554909
Apply suggestions from code review
jorisvandenbossche May 13, 2024
ac2d21a
Apply suggestions from code review
jorisvandenbossche May 13, 2024
82027d2
reflow after online edits
jorisvandenbossche May 13, 2024
5b24c24
Update web/pandas/pdeps/0014-string-dtype.md
jorisvandenbossche May 13, 2024
f9c55f4
Apply suggestions from code review
jorisvandenbossche May 13, 2024
2c58c4c
Fixup table (#2)
rhshadrach May 14, 2024
0a68504
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche May 20, 2024
8974c5b
next round of updates (small text updates, add capitalized String alias)
jorisvandenbossche May 20, 2024
cca3a7f
use capitalized alias in the overview table
jorisvandenbossche May 20, 2024
d24a80a
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jun 10, 2024
9c5342a
New revision: keep back compat for 'string', introduce 'str' for the …
jorisvandenbossche Jun 10, 2024
b5663cc
Apply suggestions from code review
jorisvandenbossche Jun 11, 2024
1c4c2d9
Update web/pandas/pdeps/0014-string-dtype.md
jorisvandenbossche Jun 12, 2024
c44bfb5
rephrase main points in proposal
jorisvandenbossche Jun 12, 2024
af5ad3c
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jun 14, 2024
bd52f39
tiny edit
jorisvandenbossche Jun 14, 2024
f8fbc61
mismatched quote
jorisvandenbossche Jun 14, 2024
d78462d
Update 0014-string-dtype.md
phofl Jul 22, 2024
4de20d1
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Expand Backward compatibility section + add proposal for deprecation
  • Loading branch information
jorisvandenbossche committed May 13, 2024
commit 9ede2e64616ddcc3a4c4b6a74b932675b0b95d03
76 changes: 62 additions & 14 deletions web/pandas/pdeps/0014-string-dtype.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,9 +244,10 @@ sufficient (they don't need to specify the storage), and the explicit

To avoid introducing a new string dtype while other discussions and changes are
in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
the default missing value sentinel? using the new NumPy 2.0 capabilities?), we
could also delay introducing a default string dtype until there is more clarity
in those other discussions.
the default missing value sentinel? using the new NumPy 2.0 capabilities?
overhauling all our dtypes to use a logical data type system?), we could also
delay introducing a default string dtype until there is more clarity in those
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
other discussions.

However:

Expand All @@ -258,6 +259,11 @@ However:
the challenges around this will not be unique to the string dtype and
therefore not a reason to delay this.

Making this change now for 3.0 will benefit the majority of our users, while
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
coming at a cost for a part of the users who already started using the
`"string"` dtype (they will have to update their code to continue to the variant
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
using `pd.NA`, see the "Backward compatibility" section below).

### Why not use the existing StringDtype with `pd.NA`?

Wouldn't adding even more variants of the string dtype make things only more
Expand Down Expand Up @@ -294,22 +300,64 @@ discussion.

The most visible backwards incompatible change will be that columns with string
data will no longer have an `object` dtype. Therefore, code that assumes
`object` dtype (such as `ser.dtype == object`) will need to be updated.
`object` dtype (such as `ser.dtype == object`) will need to be updated. This
change is done as a hard break in a major release, as warning in advance for the
changed inference is deemed to noisy.
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

To allow testing your code in advance, the
`pd.options.future.infer_string = True` option is available.

Otherwise, the actual string-specific functionality (such as the `.str` accessor
methods) should all keep working as is. By preserving the current missing value
semantics, this proposal is also backwards compatible on this aspect.

One other backwards incompatible change is present for early adopters of the
existing `StringDtype`. In pandas 3.0, calling `pd.StringDtype()` will start
returning the new default string dtype, while up to now this returned the
experimental string dtype using `pd.NA` introduced in pandas 1.0. Those users
will need to start specifying a keyword in the dtype constructor if they want to
keep using `pd.NA` (but if they just want to have a dedicated string dtype, they
don't need to change their code).
methods) should generally all keep working as is. By preserving the current
missing value semantics, this proposal is also backwards compatible on this
aspect.

### For existing users of `StringDtype`

Users of the existing `StringDtype` will see more backwards incompatible
changes, though. In pandas 3.0, calling `pd.StringDtype()` (or specifying
`dtype="string"`) will start returning the new default string dtype using `NaN`,
while up to now this returned the string dtype using `pd.NA` introduced in
pandas 1.0.

For example, this code snippet returned the NA-variant of `StringDtype` with
pandas 1.x and 2.x:

```python
>>> pd.Series(["a", "b", None], dtype="string")
0 a
1 b
2 <NA>
dtype: string
```

but will start returning the new default NaN-variant of `StringDtype` with
pandas 3.0. This means that the missing value sentinel will change from `pd.NA`
to `NaN`, and that operations will no longer return nullable dtypes but default
numpy dtypes (see the "Missing value semantics" section above).

While this change will be transparent in many cases (e.g. checking for missing
values with `isna()`/`dropna()`/`fillna()` or filtering rows with the result of
a string predicate method keeps working regardless of the sentinel), this can be
a breaking change if you relied on the exact sentinel or resulting dtype. Since
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
pandas 1.0, the string dtype has been promoted quite a bit, and so we expect
that many users already have started using this dtype, even though officially
still labeled as "experimental".
Dr-Irv marked this conversation as resolved.
Show resolved Hide resolved

To smooth the upgrade experience for those users, we propose to add a
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
deprecation warning before 3.0 when such dtype is created, giving them two
options:

- If the user just wants to have a dedicated "string" dtype (or the better
performance when using pyarrow) but is fine with using the default NaN
semantics, they can add `pd.options.future.infer_string = True` to their code
to suppress the warning and already opt-in to the future behaviour of pandas
3.0.
rhshadrach marked this conversation as resolved.
Show resolved Hide resolved
- If the user specifically wants the variant of the string dtype that uses
`pd.NA` (and returns nullable numeric/boolean dtypes in operations), they will
have to update their dtype specification from `"string"` / `pd.StringDtype()`
to `pd.StringDtype(na_value=pd.NA)` to suppress the warning and further keep
their code running as is.

## Timeline

Expand Down