Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-14: Dedicated string data type for pandas 3.0 #58551
PDEP-14: Dedicated string data type for pandas 3.0 #58551
Changes from 20 commits
fbeb69d
f03f54d
561de87
86f4e51
30c7b43
54a43b3
5b5835b
9ede2e6
f5faf4e
f554909
ac2d21a
82027d2
5b24c24
f9c55f4
2c58c4c
0a68504
8974c5b
cca3a7f
d24a80a
9c5342a
b5663cc
1c4c2d9
c44bfb5
af5ad3c
bd52f39
f8fbc61
d78462d
4de20d1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a user specify
"string[pyarrow]"
and"string[python]"
as thedtype
as it works in 2.2 ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, nothing changes there (it's mentioned in one of the paragraphs below the table that in theory we could also stop supporting the storage in the string alias for those, but that's out of scope for the PDEP). The fact that they are listed in the "String alias" column is meant to indicate that you can use that string as a string alias when specifying a dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am confused by the "string alias" aspect. There are string aliases for specifying the dtype, but also string aliases for representing the dtype (i.e., what you see if you do
Series.dtype
. I'm not sure that we are completely bidirectional in all of our dtypes with respect to the strings.Also, if we are saying that
"str"
will not indicate the storage type, should we not then deprecate the usage of"string[python]"
and"string[pyarrow]"
and just make it"string"
? The asymmetry between"str"
not specifying storage and"string"
doesn't feel right.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, and I had been going back and forth in my draft calling this columm "String alias" vs "Dtype repr", because both are partly overlapping concepts, but then also "dtype repr" is confusing because there is both
__repr__
and__str__
(eg for__str__
(used indf.dtypes
) we always just "string" without the storage suffix).In the end, this column actually shows the dtype
__repr__
, but which in practice is the most specific string alias one can use to specify the dtype (i.e. for "string[pyarrow]" you can also use "string" if pyarrow is the default storage, but the table already includes too much content).As mentioned in my previous comment, a paragraph is included to say this is left for a separate discussion (this would introduce a backwards incompatible change for existing users, while otherwise are not any on this front, and the exact way to indicate backends in string aliases, I would like to defer any final decision about that to the PDEP on logical types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I find this really unfortunate but agree with Joris that we shouldn't rock the boat too much on this PDEP, especially since it is just about strings. The larger discussion on resolving those aliases should be had in PDEP-13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe relabel "String alias" to "String alias for dtype input" ?
Part of the confusion is this with 2.2:
When you print a
Series
, it shows thedtype
as"string"
, but if you look at thedtype
, it shows"string[python]"
or"string[pyarrow]"
This is where my confusion is coming from.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the difference between
__repr__
and__str__
that I mentioned. Now, this is all existing behaviour, but so for the new dtype I am proposing to not have any difference here (both repr and str would show"str"
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe showing the storage in both
__repr__
and__str__
is very important. When you work of large Series, you have two main constraints: storage and computation efficiency. python backed and pyarrow backed are much different in these aspects. Whereas pyarrow is the best option most of the time, if you need mutability, python backed strings could be a better option.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arnaudlegout it's definitely true that there are trade-offs in deciding which option is best (and for now, you will always be able to specify the storage and inspect it through an attribute).
I think the intent at the moment, though, is to consider the default of pyarrow (if installed) as the general default that almost everyone is expected to use, and that we don't incentivize too much to switch to the non-default storage (i.e. that it is more considered as an implementation detail which storage is used, for most users).
At least that's the sense I got from the discussion here. But more real-world feedback once we actually switch the default in 3.0 will probably give better insights in how big the drawbacks are in workflows heavy on mutability, and to what extent we should more clearly document those trade-offs (at some point we will also have to decide whether to keep the object-dtype based fallback long term, for example when making pyarrow required or when having a numpy string dtype based one).
My personal take on showing the storage in the repr/str is that for most users, I expect that they shouldn't have to care about the exact storage (or even be aware of the concept), and therefore I think it is more confusing to show this by default in the
df.dtypes
orser
repr (which both uses the dtype's__str__
).If we do want to show it in the dtype's
__repr__
(to make it more discoverable), I would personally advocate for making the__repr__
more like a class repr instead of a string alias (eg printing something like<pandas.StringDtype(storage=...)>
instead ofstring[storage]
, to avoid the[]
use for this).What exactly the
__repr__
should look like for this new dtype is not included exactly in the PDEP, so feel free to open a new issue to argue for including the storage where we can continue this discussion.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done #59342