Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-14: Dedicated string data type for pandas 3.0 #58551

Merged
Merged
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
fbeb69d
PDEP: Dedicated string data type for pandas 3.0
jorisvandenbossche May 3, 2024
f03f54d
small textual edits and typos
jorisvandenbossche May 3, 2024
561de87
address part of the feedback
jorisvandenbossche May 5, 2024
86f4e51
Update web/pandas/pdeps/00xx-string-dtype.md
jorisvandenbossche May 5, 2024
30c7b43
rename file
jorisvandenbossche May 13, 2024
54a43b3
expand Missing value semantics section
jorisvandenbossche May 13, 2024
5b5835b
expand Naming subsection with storage+na_value proposal
jorisvandenbossche May 13, 2024
9ede2e6
Expand Backward compatibility section + add proposal for deprecation
jorisvandenbossche May 13, 2024
f5faf4e
update timeline
jorisvandenbossche May 13, 2024
f554909
Apply suggestions from code review
jorisvandenbossche May 13, 2024
ac2d21a
Apply suggestions from code review
jorisvandenbossche May 13, 2024
82027d2
reflow after online edits
jorisvandenbossche May 13, 2024
5b24c24
Update web/pandas/pdeps/0014-string-dtype.md
jorisvandenbossche May 13, 2024
f9c55f4
Apply suggestions from code review
jorisvandenbossche May 13, 2024
2c58c4c
Fixup table (#2)
rhshadrach May 14, 2024
0a68504
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche May 20, 2024
8974c5b
next round of updates (small text updates, add capitalized String alias)
jorisvandenbossche May 20, 2024
cca3a7f
use capitalized alias in the overview table
jorisvandenbossche May 20, 2024
d24a80a
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jun 10, 2024
9c5342a
New revision: keep back compat for 'string', introduce 'str' for the …
jorisvandenbossche Jun 10, 2024
b5663cc
Apply suggestions from code review
jorisvandenbossche Jun 11, 2024
1c4c2d9
Update web/pandas/pdeps/0014-string-dtype.md
jorisvandenbossche Jun 12, 2024
c44bfb5
rephrase main points in proposal
jorisvandenbossche Jun 12, 2024
af5ad3c
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jun 14, 2024
bd52f39
tiny edit
jorisvandenbossche Jun 14, 2024
f8fbc61
mismatched quote
jorisvandenbossche Jun 14, 2024
d78462d
Update 0014-string-dtype.md
phofl Jul 22, 2024
4de20d1
Merge remote-tracking branch 'upstream/main' into pdep-string-dtype
jorisvandenbossche Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Apply suggestions from code review
Co-authored-by: Irv Lustig <irv@princeton.com>
  • Loading branch information
jorisvandenbossche and Dr-Irv authored May 13, 2024
commit f554909e95e055745227e945e31dfc5fabc1c0bf
55 changes: 28 additions & 27 deletions web/pandas/pdeps/0014-string-dtype.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ default in pandas 3.0:
This will give users a long-awaited proper string dtype for 3.0, while 1) not
(yet) making PyArrow a _hard_ dependency, but only a dependency used by default,
and 2) leaving room for future improvements (different missing value semantics,
using NumPy 2.0, etc).
using NumPy 2.0 strings, etc).

## Background

Expand Down Expand Up @@ -74,7 +74,7 @@ reconsideration:
runtime dependency. In addition, NumPy 2.0 could in the future potentially
reduce the need to make PyArrow a required dependency specifically for a
dedicated pandas string dtype.
- The PDEP did not consider the usage of the experimental `pd.NA` as a
- PDEP-10 did not consider the usage of the experimental `pd.NA` as a
consequence of adopting one of the existing implementations of the
`StringDtype`.

Expand All @@ -88,23 +88,23 @@ At the time, the `storage` option for this new variant was called
`pd.NA` (but this PDEP proposes a better naming scheme, see the "Naming"
subsection below).

This last dtype variant is what you currently (pandas 2.2) get for string data
This last dtype variant is what users currently (pandas 2.2) get for string data
when enabling the ``future.infer_string`` option (to enable the behaviour which
is intended to become the default in pandas 3.0).

## Proposal

To be able to move forward with a string data type in pandas 3.0, this PDEP proposes:

1. For pandas 3.0, we enable a "string" dtype by default, which will use PyArrow
1. For pandas 3.0, a "string" dtype is enabled by default, which will use PyArrow
if installed, and otherwise falls back to an in-house functionally-equivalent
(but slower) version.
2. This default "string" dtype will follow the same behaviour for missing values
as our other default data types, and use `NaN` as the missing value sentinel.
as other default data types, and use `NaN` as the missing value sentinel.
3. The version that is not backed by PyArrow can reuse (with minor code
additions) the existing numpy object-dtype backed StringArray for its
implementation.
4. We update installation guidelines to clearly encourage users to install
4. Installation guidelines are updated to clearly encourage users to install
pyarrow for the default user experience.

Those string dtypes enabled by default will then no longer be considered as
Expand Down Expand Up @@ -145,7 +145,7 @@ that:
nullable `'Int64"` / `pd.Int64Dtype()` dtype instead of the numpy `int64`
dtype (or `float64` in case of missing values)).

However, up to this date, all other default data types still use NaN semantics
However, up to this date, all other default data types still use `NaN` semantics
for missing values. Therefore, this proposal says that a new default string
dtype should also still use the same default missing value semantics and return
default data types when doing operations on the string column, to be consistent
Expand Down Expand Up @@ -176,9 +176,10 @@ needs minor changes to follow the above-mentioned missing value semantics
([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)).

For pandas 3.0, this is the most realistic option given this implementation has
already been available for a long time. Beyond 3.0, we can still explore further
already been available for a long time. Beyond 3.0, further
improvements such as using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503))
or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)),
or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552))
can still be explored,
but at that point that is an implementation detail that should not have a
direct impact on users (except for performance).

Expand All @@ -187,7 +188,7 @@ direct impact on users (except for performance).
Given the long history of this topic, the naming of the dtypes is a difficult
topic.

In the first place, we need to acknowledge that most users should not need to
In the first place, it should be acknowledged that most users should not need to
use storage-specific options. Users are expected to specify `pd.StringDtype()`
or `"string"`, and that will give them their default string dtype (which
depends on whether PyArrow is installed or not).
Expand All @@ -201,8 +202,8 @@ Currently (pandas 2.2), `StringDtype(storage="pyarrow_numpy")` is used, where
the `"pyarrow_numpy"` storage was used to disambiguate from the existing
`"pyarrow"` option using `pd.NA`. However, "pyarrow_numpy" is a rather
confusing option and doesn't generalize well. Therefore, this PDEP proposes
a new naming scheme as outlined below, and we will deprecate and remove
"pyarrow_numpy" before pandas 3.0.
a new naming scheme as outlined below, and
"pyarrow_numpy" will be deprecated and removed before pandas 3.0.

The `storage` keyword of `StringDtype` is kept to disambiguate the underlying
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
storage of the string data (using pyarrow or python objects), but an additional
Expand All @@ -227,12 +228,12 @@ Notes:

- (1) You get "pyarrow" or "python" depending on pyarrow being installed.
- (2) Those three rows are backwards incompatible (i.e. they work now but give
you the NA-variant), see the "Backward compatibility" section below.
the NA-variant), see the "Backward compatibility" section below.
- (3) "pyarrow_numpy" is kept temporarily because this is already in a released
version, but we can deprecate it in 2.2.x and have it removed for 3.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd have to deprecate in 2.3 not a 2.2.x ?

jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

For the new default string dtype, only the `"string"` alias can be used to
specify the dtype as a string, i.e. we would not provide a way to make the
specify the dtype as a string, i.e. a way would not be provided to make the
underlying storage (pyarrow or python) explicit through the string alias. This
string alias is only a convenience shortcut and for most users `"string"` is
sufficient (they don't need to specify the storage), and the explicit
Expand All @@ -245,23 +246,23 @@ sufficient (they don't need to specify the storage), and the explicit
To avoid introducing a new string dtype while other discussions and changes are
in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
the default missing value sentinel? using the new NumPy 2.0 capabilities?
overhauling all our dtypes to use a logical data type system?), we could also
delay introducing a default string dtype until there is more clarity in those
overhauling all our dtypes to use a logical data type system?),
introducing a default string dtype could also be delayed until there is more clarity in those
other discussions.

However:

1. Delaying has a cost: it further postpones introducing a dedicated string
dtype that has massive benefits for our users, both in usability as (for the
dtype that has massive benefits for users, both in usability as (for the
significant part of the user base that has PyArrow installed) in performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you can say "significant" yet. I would delete that word.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you can say "significant" yet. I would delete that word.

Deleted it.

2. In case we eventually transition to use `pd.NA` as the default missing value
sentinel, we will need a migration path for _all_ our data types, and thus
2. In case pandas eventually transitions to use `pd.NA` as the default missing value
sentinel, a migration path for _all_ our data types will be needed, and thus
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
the challenges around this will not be unique to the string dtype and
therefore not a reason to delay this.

Making this change now for 3.0 will benefit the majority of our users, while
Making this change now for 3.0 will benefit the majority of users, while
coming at a cost for a part of the users who already started using the
`"string"` dtype (they will have to update their code to continue to the variant
`"string"` or `pd.StringDtype()` dtype (they will have to update their code to continue to the variant
using `pd.NA`, see the "Backward compatibility" section below).

### Why not use the existing StringDtype with `pd.NA`?
Expand Down Expand Up @@ -302,10 +303,10 @@ The most visible backwards incompatible change will be that columns with string
data will no longer have an `object` dtype. Therefore, code that assumes
`object` dtype (such as `ser.dtype == object`) will need to be updated. This
change is done as a hard break in a major release, as warning in advance for the
changed inference is deemed to noisy.
changed inference is deemed too noisy.

To allow testing your code in advance, the
`pd.options.future.infer_string = True` option is available.
To allow testing code in advance, the
`pd.options.future.infer_string = True` option is available for users.

Otherwise, the actual string-specific functionality (such as the `.str` accessor
methods) should generally all keep working as is. By preserving the current
Expand Down Expand Up @@ -339,12 +340,12 @@ numpy dtypes (see the "Missing value semantics" section above).
While this change will be transparent in many cases (e.g. checking for missing
values with `isna()`/`dropna()`/`fillna()` or filtering rows with the result of
a string predicate method keeps working regardless of the sentinel), this can be
a breaking change if you relied on the exact sentinel or resulting dtype. Since
a breaking change if users relied on the exact sentinel or resulting dtype. Since
pandas 1.0, the string dtype has been promoted quite a bit, and so we expect
that many users already have started using this dtype, even though officially
still labeled as "experimental".

To smooth the upgrade experience for those users, we propose to add a
To smooth the upgrade experience for those users, it is proposed to add a
deprecation warning before 3.0 when such dtype is created, giving them two
options:

Expand All @@ -368,7 +369,7 @@ Some small enhancements or fixes might still be needed and can continue to be
backported to pandas 2.2.x.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those fixes should be in a 2.3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it seems we haven't had any fixes yet in 2.2.x, we merged several fixes for the future default string dtype mode in 2.1.x (after the initial 2.1.0 release). I would think we can continue doing that for fixes, but can also just leave out this sentence if there is disagreement.

(I think the general rule of this being discussed on a PR basis whether it should be backported or not, depending on how critical the fix is, would apply here, and so that maybe doesn't require explicit mentioning)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those fixes should be in a 2.3

Removed this sentence.


The variant using numpy object-dtype can also be backported to the 2.2.x branch
to allow easier testing. We would propose to release this as 2.3.0 (created from
to allow easier testing. It is proposed to release this as 2.3.0 (created from
the 2.2.x branch, given that the main branch already includes many other changes
targeted for 3.0), together with the deprecation warning when creating a dtype
from `"string"` / `pd.StringDtype()`.
Expand Down
Loading