Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-14: Dedicated string data type for pandas 3.0 #58551

Merged
Merged
Changes from 20 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
fbeb69d
PDEP: Dedicated string data type for pandas 3.0
jorisvandenbossche May 3, 2024
f03f54d
small textual edits and typos
jorisvandenbossche May 3, 2024
561de87
address part of the feedback
jorisvandenbossche May 5, 2024
86f4e51
Update web/pandas/pdeps/00xx-string-dtype.md
jorisvandenbossche May 5, 2024
30c7b43
rename file
jorisvandenbossche May 13, 2024
54a43b3
expand Missing value semantics section
jorisvandenbossche May 13, 2024
5b5835b
expand Naming subsection with storage+na_value proposal
jorisvandenbossche May 13, 2024
9ede2e6
Expand Backward compatibility section + add proposal for deprecation
jorisvandenbossche May 13, 2024
f5faf4e
update timeline
jorisvandenbossche May 13, 2024
f554909
Apply suggestions from code review
jorisvandenbossche May 13, 2024
ac2d21a
Apply suggestions from code review
jorisvandenbossche May 13, 2024
82027d2
reflow after online edits
jorisvandenbossche May 13, 2024
5b24c24
Update web/pandas/pdeps/0014-string-dtype.md
jorisvandenbossche May 13, 2024
f9c55f4
Apply suggestions from code review
jorisvandenbossche May 13, 2024
2c58c4c
Fixup table (#2)
rhshadrach May 14, 2024
0a68504
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche May 20, 2024
8974c5b
next round of updates (small text updates, add capitalized String alias)
jorisvandenbossche May 20, 2024
cca3a7f
use capitalized alias in the overview table
jorisvandenbossche May 20, 2024
d24a80a
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jun 10, 2024
9c5342a
New revision: keep back compat for 'string', introduce 'str' for the …
jorisvandenbossche Jun 10, 2024
b5663cc
Apply suggestions from code review
jorisvandenbossche Jun 11, 2024
1c4c2d9
Update web/pandas/pdeps/0014-string-dtype.md
jorisvandenbossche Jun 12, 2024
c44bfb5
rephrase main points in proposal
jorisvandenbossche Jun 12, 2024
af5ad3c
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jun 14, 2024
bd52f39
tiny edit
jorisvandenbossche Jun 14, 2024
f8fbc61
mismatched quote
jorisvandenbossche Jun 14, 2024
d78462d
Update 0014-string-dtype.md
phofl Jul 22, 2024
4de20d1
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
373 changes: 373 additions & 0 deletions web/pandas/pdeps/0014-string-dtype.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,373 @@
# PDEP-14: Dedicated string data type for pandas 3.0

- Created: May 3, 2024
- Status: Under discussion
- Discussion: https://github.com/pandas-dev/pandas/pull/58551
- Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche)
- Revision: 1

## Abstract

This PDEP proposes to introduce a dedicated string dtype that will be used by
default in pandas 3.0:

* In pandas 3.0, enable a string dtype (`"str"`) by default, using PyArrow if available
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
or otherwise a string dtype using numpy object-dtype under the hood as fallback.
* The default string dtype will use missing value semantics (using NaN) consistent
with the other default data types.

This will give users a long-awaited proper string dtype for 3.0, while 1) not
(yet) making PyArrow a _hard_ dependency, but only a dependency used by default,
and 2) leaving room for future improvements (different missing value semantics,
using NumPy 2.0 strings, etc).

## Background

Currently, pandas by default stores text data in an `object`-dtype NumPy array.
The current implementation has two primary drawbacks. First, `object` dtype is
not specific to strings: any Python object can be stored in an `object`-dtype
array, not just strings, and seeing `object` as the dtype for a column with
strings is confusing for users. Second: this is not efficient (all string
methods on a Series are eventually calling Python methods on the individual
string objects).

To solve the first issue, a dedicated extension dtype for string data has
already been
[added in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#dedicated-string-data-type).
This has always been opt-in for now, requiring users to explicitly request the
dtype (with `dtype="string"` or `dtype=pd.StringDtype()`). The array backing
this string dtype was initially almost the same as the default implementation,
i.e. an `object`-dtype NumPy array of Python strings.

To solve the second issue (performance), pandas contributed to the development
of string kernels in the PyArrow package, and a variant of the string dtype
backed by PyArrow was
[added in pandas 1.3](https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#pyarrow-backed-string-data-type).
This could be specified with the `storage` keyword in the opt-in string dtype
(`pd.StringDtype(storage="pyarrow")`).

Since its introduction, the `StringDtype` has always been opt-in, and has used
the experimental `pd.NA` sentinel for missing values (which was also [introduced
in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)).
However, up to this date, pandas has not yet taken the step to use `pd.NA` by
default for any dtype, and thus the `StringDtype` deviates in missing value behaviour compared
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default for any dtype, and thus the `StringDtype` deviates in missing value behaviour compared
default for all dtypes, and thus the `StringDtype` deviates in missing value behaviour compared

to the default data types.

In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html)
proposed to start using a PyArrow-backed string dtype by default in pandas 3.0
(i.e. infer this type for string data instead of object dtype). To ensure we
could use the variant of `StringDtype` backed by PyArrow instead of Python
objects (for better performance), it proposed to make `pyarrow` a new required
runtime dependency of pandas.

In the meantime, NumPy has also been working on a native variable-width string
data type, which will be available [starting with NumPy
2.0](https://numpy.org/devdocs/release/2.0.0-notes.html#stringdtype-has-been-added-to-numpy).
This can provide a potential alternative to PyArrow for implementing a string
data type in pandas that is not backed by Python objects.

After acceptance of PDEP-10, two aspects of the proposal have been under
reconsideration:

- Based on feedback from users and maintainers from other packages (mostly
around installation complexity and size), it has been considered to relax the
new `pyarrow` requirement to not be a _hard_ runtime dependency. In addition,
NumPy 2.0 could in the future potentially reduce the need to make PyArrow a
required dependency specifically for a dedicated pandas string dtype.
- PDEP-10 did not consider the usage of the experimental `pd.NA` as a
consequence of adopting one of the existing implementations of the
`StringDtype`.

For the second aspect, another variant of the `StringDtype` was
[introduced in pandas 2.1](https://pandas.pydata.org/docs/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings)
that is still backed by PyArrow but follows the default missing values semantics
pandas uses for all other default data types (and using `NaN` as the missing
value sentinel) ([GH-54792](https://github.com/pandas-dev/pandas/issues/54792)).
At the time, the `storage` option for this new variant was called
`"pyarrow_numpy"` to disambiguate from the existing `"pyarrow"` option using
`pd.NA` (but this PDEP proposes a better naming scheme, see the "Naming"
subsection below).

This last dtype variant is what users currently (pandas 2.2) get for string data
when enabling the ``future.infer_string`` option (to enable the behaviour which
is intended to become the default in pandas 3.0).

## Proposal

To be able to move forward with a string data type in pandas 3.0, this PDEP proposes:

1. For pandas 3.0, a `"str"` string dtype is enabled by default, which will use PyArrow
if installed, and otherwise falls back to an in-house functionally-equivalent
(but slower) version.
2. This default string dtype will follow the same behaviour for missing values
as other default data types, and use `NaN` as the missing value sentinel.
3. The version that is not backed by PyArrow can reuse (with minor code
additions) the existing numpy object-dtype backed StringArray for its
implementation.
4. Installation guidelines are updated to clearly encourage users to install
pyarrow for the default user experience.

Those string dtypes enabled by default will then no longer be considered as
experimental.

### Default inference of a string dtype
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

By default, pandas will infer this new string dtype instead of object dtype for
string data (when creating pandas objects, such as in constructors or IO
functions).

In pandas 2.2, the existing `future.infer_string` option can be used to opt-in to the future
default behaviour:

```python
>>> pd.options.future.infer_string = True
>>> pd.Series(["a", "b", None])
0 a
1 b
2 NaN
dtype: string
```

Right now (pandas 2.2), the existing option only enables the PyArrow-based
future dtype. For the remaining 2.x releases, this option will be expanded to
also work when PyArrow is not installed to enable the object-dtype fallback in
that case.

### Missing value semantics

As mentioned in the background section, the original `StringDtype` has always
used the experimental `pd.NA` sentinel for missing values. In addition to using
`pd.NA` as the scalar for a missing value, this essentially means that:

- String columns follow ["NA-semantics"](https://pandas.pydata.org/docs/user_guide/missing_data.html#na-semantics)
for missing values, where `NA` propagates in boolean operations such as
comparisons or predicates.
- Operations on the string column that give a numeric or boolean result use the
nullable Integer/Float/Boolean data types (e.g. `ser.str.len()` returns the
nullable `'Int64"` / `pd.Int64Dtype()` dtype instead of the numpy `int64`
dtype (or `float64` in case of missing values)).

However, up to this date, all other default data types still use `NaN` semantics
for missing values. Therefore, this proposal says that a new default string
dtype should also still use the same default missing value semantics and return
default data types when doing operations on the string column, to be consistent
with the other default dtypes at this point.

In practice, this means that the default string dtype will use `NaN` as
the missing value sentinel, and:

- String columns will follow NaN-semantics for missing values, where `NaN` gives
False in boolean operations such as comparisons or predicates.
- Operations on the string column that give a numeric or boolean result will use
the default data types (i.e. numpy `int64`/`float64`/`bool`).

Because the original `StringDtype` implementations already use `pd.NA` and
return masked integer and boolean arrays in operations, a new variant of the
existing dtypes that uses `NaN` and default data types was needed. The original
variant of `StringDtype` using `pd.NA` will continue to be available for those
who were already using it.

### Object-dtype "fallback" implementation

To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep
a "fallback" option in case PyArrow is not installed. The original `StringDtype`
backed by a numpy object-dtype array of Python strings can be mostly reused for
this (adding a new variant of the dtype) and a new `StringArray` subclass only
needs minor changes to follow the above-mentioned missing value semantics
([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)).

For pandas 3.0, this is the most realistic option given this implementation has
already been available for a long time. Beyond 3.0, further improvements such as
using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503))
or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)) can
still be explored, but at that point that is an implementation detail that
should not have a direct impact on users (except for performance).

For the original variant of `StringDtype` using `pd.NA`, currently the default
storage is `"python"` (the object-dtype based implementation). Also for this
variant, it is proposed follow the same logic for determining the default
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
storage, i.e. the default to `"pyarrow"` if available, and otherwise
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
fall back to `"python"`.

### Naming

Given the long history of this topic, the naming of the dtypes is a difficult
topic.

In the first place, it should be acknowledged that most users should not need to
use storage-specific options. Users are expected to specify a generic name (such
as `"str"` or `"string"`), and that will give them their default string dtype
(which depends on whether PyArrow is installed or not).

For the generic string alias to specify the dtype, `"string"` is already used
for the `StringDtype` using `pd.NA`. This PDEP proposes to use `"str"` for the
new default `StringDtype` using `NaN`. This ensures backwards compatibility for
code using `dtype="string"`, and was also chosen because `dtype="str"` or
`dtype=str` currently already works to ensure your data is converted to
strings (only using object dtype for the result).

But for testing purposes and advanced use cases that want control over the exact
variant of the `StringDtype`, we need some way to specify this and distinguish
them from the other string dtypes.

Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used for the new variant using `NaN`,
where the `"pyarrow_numpy"` storage was used to disambiguate from the existing
`"pyarrow"` option using `pd.NA`. However, `"pyarrow_numpy"` is a rather confusing
option and doesn't generalize well. Therefore, this PDEP proposes a new naming
scheme as outlined below, and `"pyarrow_numpy"` will be deprecated and removed
before pandas 3.0.
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

The `storage` keyword of `StringDtype` is kept to disambiguate the underlying
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
storage of the string data (using pyarrow or python objects), but an additional
`na_value` is introduced to disambiguate the the variants using NA semantics
mroeschke marked this conversation as resolved.
Show resolved Hide resolved
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
and NaN semantics.

Overview of the different ways to specify a dtype and the resulting concrete
dtype of the data:

| User specification | Concrete dtype | String alias | Note |
|---------------------------------------------|---------------------------------------------------------------|---------------------------------------|----------|
| Unspecified (inference) | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "str" | (1) |
| `"str"` or `StringDtype(na_value=np.nan)` | `StringDtype(storage="pyarrow"\|"python", na_value=np.nan)` | "str" | (1) |
| `StringDtype("pyarrow", na_value=np.nan)` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "str" | |
| `StringDtype("python", na_value=np.nan)` | `StringDtype(storage="python", na_value=np.nan)` | "str" | |
| `StringDtype("pyarrow")` | `StringDtype(storage="pyarrow", na_value=pd.NA)` | "string[pyarrow]" | |
| `StringDtype("python")` | `StringDtype(storage="python", na_value=pd.NA)` | "string[python]" | |
Comment on lines +235 to +236
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a user specify "string[pyarrow]" and "string[python]" as the dtype as it works in 2.2 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nothing changes there (it's mentioned in one of the paragraphs below the table that in theory we could also stop supporting the storage in the string alias for those, but that's out of scope for the PDEP). The fact that they are listed in the "String alias" column is meant to indicate that you can use that string as a string alias when specifying a dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am confused by the "string alias" aspect. There are string aliases for specifying the dtype, but also string aliases for representing the dtype (i.e., what you see if you do Series.dtype. I'm not sure that we are completely bidirectional in all of our dtypes with respect to the strings.

Also, if we are saying that "str" will not indicate the storage type, should we not then deprecate the usage of "string[python]" and "string[pyarrow]" and just make it "string" ? The asymmetry between "str" not specifying storage and "string" doesn't feel right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am confused by the "string alias" aspect. There are string aliases for specifying the dtype, but also string aliases for representing the dtype (i.e., what you see if you do Series.dtype. I'm not sure that we are completely bidirectional in all of our dtypes with respect to the strings.

I know, and I had been going back and forth in my draft calling this columm "String alias" vs "Dtype repr", because both are partly overlapping concepts, but then also "dtype repr" is confusing because there is both __repr__ and __str__ (eg for __str__ (used in df.dtypes) we always just "string" without the storage suffix).

In the end, this column actually shows the dtype __repr__, but which in practice is the most specific string alias one can use to specify the dtype (i.e. for "string[pyarrow]" you can also use "string" if pyarrow is the default storage, but the table already includes too much content).

Also, if we are saying that "str" will not indicate the storage type, should we not then deprecate the usage of "string[python]" and "string[pyarrow]" and just make it "string" ?

As mentioned in my previous comment, a paragraph is included to say this is left for a separate discussion (this would introduce a backwards incompatible change for existing users, while otherwise are not any on this front, and the exact way to indicate backends in string aliases, I would like to defer any final decision about that to the PDEP on logical types)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I find this really unfortunate but agree with Joris that we shouldn't rock the boat too much on this PDEP, especially since it is just about strings. The larger discussion on resolving those aliases should be had in PDEP-13

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, this column actually shows the dtype __repr__, but which in practice is the most specific string alias one can use to specify the dtype (i.e. for "string[pyarrow]" you can also use "string" if pyarrow is the default storage, but the table already includes too much content).

Maybe relabel "String alias" to "String alias for dtype input" ?

Part of the confusion is this with 2.2:

>>> pd.Series(["abc"], dtype="string[pyarrow]")
0    abc
dtype: string
>>> pd.Series(["abc"], dtype="string[pyarrow]").dtype
string[pyarrow]
>>> pd.Series(["abc"], dtype="string")
0    abc
dtype: string
>>> pd.Series(["abc"], dtype="string").dtype
string[python]

When you print a Series, it shows the dtype as "string", but if you look at the dtype, it shows "string[python]" or "string[pyarrow]" This is where my confusion is coming from.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you print a Series, it shows the dtype as "string", but if you look at the dtype, it shows "string[python]" or "string[pyarrow]" This is where my confusion is coming from.

Yes, that's the difference between __repr__ and __str__ that I mentioned. Now, this is all existing behaviour, but so for the new dtype I am proposing to not have any difference here (both repr and str would show "str")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe showing the storage in both __repr__ and __str__ is very important. When you work of large Series, you have two main constraints: storage and computation efficiency. python backed and pyarrow backed are much different in these aspects. Whereas pyarrow is the best option most of the time, if you need mutability, python backed strings could be a better option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnaudlegout it's definitely true that there are trade-offs in deciding which option is best (and for now, you will always be able to specify the storage and inspect it through an attribute).

I think the intent at the moment, though, is to consider the default of pyarrow (if installed) as the general default that almost everyone is expected to use, and that we don't incentivize too much to switch to the non-default storage (i.e. that it is more considered as an implementation detail which storage is used, for most users).
At least that's the sense I got from the discussion here. But more real-world feedback once we actually switch the default in 3.0 will probably give better insights in how big the drawbacks are in workflows heavy on mutability, and to what extent we should more clearly document those trade-offs (at some point we will also have to decide whether to keep the object-dtype based fallback long term, for example when making pyarrow required or when having a numpy string dtype based one).

My personal take on showing the storage in the repr/str is that for most users, I expect that they shouldn't have to care about the exact storage (or even be aware of the concept), and therefore I think it is more confusing to show this by default in the df.dtypes or ser repr (which both uses the dtype's __str__).
If we do want to show it in the dtype's __repr__ (to make it more discoverable), I would personally advocate for making the __repr__ more like a class repr instead of a string alias (eg printing something like <pandas.StringDtype(storage=...)> instead of string[storage], to avoid the [] use for this).

What exactly the __repr__ should look like for this new dtype is not included exactly in the PDEP, so feel free to open a new issue to argue for including the storage where we can continue this discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done #59342

| `"string"` or `StringDtype()` | `StringDtype(storage="pyarrow"\|"python", na_value=pd.NA)` | "string[pyarrow]" or "string[python]" | (1) |
| `StringDtype("pyarrow_numpy")` | `StringDtype(storage="pyarrow", na_value=np.nan)` | "string[pyarrow_numpy]" | (2) |

Notes:
WillAyd marked this conversation as resolved.
Show resolved Hide resolved

- (1) You get "pyarrow" or "python" depending on pyarrow being installed.
- (2) "pyarrow_numpy" is kept temporarily because this is already in a released
version, but we can deprecate it in 2.x and have it removed for 3.0.
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

For the new default string dtype, only the `"str"` alias can be used to
specify the dtype as a string, i.e. pandas would not provide a way to make the
underlying storage (pyarrow or python) explicit through the string alias. This
string alias is only a convenience shortcut and for most users `"str"` is
sufficient (they don't need to specify the storage), and the explicit
`pd.StringDtype(storage=..., na_value=np.nan)` is still available for more
fine-grained control.

Also for the existing variant using `pd.NA`, specifying the storage through the
string alias could be deprecated, but that is left for a separate decision.
Dr-Irv marked this conversation as resolved.
Show resolved Hide resolved

## Alternatives

### Why not delay introducing a default string dtype?

To avoid introducing a new string dtype while other discussions and changes are
in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
the default missing value sentinel? using the new NumPy 2.0 capabilities?
overhauling all our dtypes to use a logical data type system?), introducing a
default string dtype could also be delayed until there is more clarity in those
other discussions. Specifically, it would avoid temporarily switching to use
`NaN` for the string dtype, while in a future version we might switch back
to `pd.NA` by default.

However:

1. Delaying has a cost: it further postpones introducing a dedicated string
dtype that has significant benefits for users, both in usability as (for the
part of the user base that has PyArrow installed) in performance.
2. In case pandas eventually transitions to use `pd.NA` as the default missing value
sentinel, a migration path for _all_ pandas data types will be needed, and thus
the challenges around this will not be unique to the string dtype and
therefore not a reason to delay this.

Making this change now for 3.0 will benefit the majority of users, and the PDEP
author believes this is worth the cost of the added complexity around "yet
another dtype" (also for other data types we already have multiple variants).

### Why not use the existing StringDtype with `pd.NA`?

Wouldn't adding even more variants of the string dtype make things only more
confusing? Indeed, this proposal unfortunately introduces more variants of the
string dtype. However, the reason for this is to ensure the actual default user
experience is _less_ confusing, and the new string dtype fits better with the
other default data types.

If the new default string data type would use `pd.NA`, then after some
operations, a user can easily end up with a DataFrame that mixes columns using
`NaN` semantics and columns using `NA` semantics (and thus a DataFrame that
could have columns with two different int64, two different float64, two different
bool, etc dtypes). This would lead to a very confusing default experience.

With the proposed new variant of the StringDtype, this will ensure that for the
_default_ experience, a user will only see only 1 kind of integer dtype, only
kind of 1 bool dtype, etc. For now, a user should only get columns using `pd.NA`
when explicitly opting into this.

### Naming alternatives

An initial version of this PDEP proposed to use the `"string"` alias and the
default `pd.StringDtype()` class constructor for the new default dtype.
However, that caused a lot of discussion around backwards compatibility for
existing users of the `StringDtype` using `pd.NA`.
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

During the discussion, several alternatives have been brought up. Both
alternative keyword names as using a different constructor. In the end,
this PDEP proposes to use a different string alias (`"str"`) but to keep
using the existing `pd.StringDtype` (with the existing `storage` keyword but
with an additional `na_value` keyword) for now to keep the changes as
minimal as possible, leaving a larger overhaul of the dtype system (potentially
including different constructor functions or namespace) for a future discussion.
See [GH-58613](https://github.com/pandas-dev/pandas/issues/58613) for the full
discussion.

One consequence is that when using the class constructor for the default dtype,
it has to be used with non-default arguments, i.e. a user needs to specify
`pd.StringDtype(na_value=np.nan)` to get the default dtype using `NaN`.
Therefore, the pandas documentation will focus on the usage of `dtype="str"`.

## Backward compatibility

The most visible backwards incompatible change will be that columns with string
data will no longer have an `object` dtype. Therefore, code that assumes
`object` dtype (such as `ser.dtype == object`) will need to be updated. This
change is done as a hard break in a major release, as warning in advance for the
changed inference is deemed too noisy.

To allow testing code in advance, the
`pd.options.future.infer_string = True` option is available for users.

Otherwise, the actual string-specific functionality (such as the `.str` accessor
methods) should generally all keep working as is.

By preserving the current missing value semantics, this proposal is also mostly
backwards compatible on this aspect. When storing strings in object dtype, pandas
however did allow using `None` as the missing value indicator as well (and in
certain cases such as the `shift` method, pandas even introduced this itself).
For all the cases where currently `None` was used as the missing value sentinel,
this will change to use `NaN` consistently.
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

### For existing users of `StringDtype`

Existing code that already opted in to use the `StringDtype` using `pd.NA`
should generally keep working as is. The latest version of this PDEP preserves
the behaviour of `dtype="string"` or `dtype=pd.StringDtype()` to mean the
`pd.NA` variant of the dtype.

It does propose the change the default storage to `"pyarrow"` (if available) for
the opt-in `pd.NA` variant as well, but this should not have much user-visible
impact.
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

## Timeline

The future PyArrow-backed string dtype was already made available behind a feature
flag in pandas 2.1 (enabled by `pd.options.future.infer_string = True`).

The variant using numpy object-dtype can also be backported to the 2.2.x branch
to allow easier testing. It is proposed to release this as 2.3.0 (created from
the 2.2.x branch, given that the main branch already includes many other changes
targeted for 3.0), together with the changes to the naming scheme.

The 2.3.0 release would then have all future string functionality available
(both the pyarrow and object-dtype based variants of the default string dtype).

For pandas 3.0, this `future.infer_string` flag becomes enabled by default.

## PDEP-XX History

- 3 May 2024: Initial version
Loading