-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
PDEP-18: Nullable Object Dtype #61599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
“Object” analogous to “Float64”? |
At least to me the PDEP will be easier to read (and comment on) if you limit the line width, to 80 or similar. The idea sounds good, it'd be good if you can provide information on how using a boolean mask compares to having |
sure.
using a sentinel as opposed to a mask is an implementation detail that I can expand on. I'm assuming that this would still be a separate dtype from the traditional numpy dtype, a pandas nullable dtype? We have the string array backed by a masked object array that I was effectively proposing reusing/refactoring as a base class. There is also another option which is maybe what you are proposing: making a breaking change to the exisiting numpy object dtype to handle pd.NA differently? This is perhaps what is in the rejected ideas section and that needs clarification? |
That's the obvious choice but IIRC the capitalization was considered as confusing/non intuitive by some when discussed with respect to the string dtype. |
naming the nullable object dtype "Object" aligns well with pandas’ approach to evolving its dtypes (like "Int64" for integers, "Float64" for floats and "Boolean" for the nullable Also bear in mind that, being effectively a tweek to the pd.NA variant of the python backed string dtype, the repr could be just "object", just as "string[pyarrow]" is shown as just "string". This effectively indicates the logical type and that instead of using the dtype string alias we could recommend constructing the nullable array for testing/evaluation using the Dtype object using the patterns that there seemed to be some consensus on in PDEP-13. There may be advantage of using "object" as the repr instead of "Object" as this could potentially simplify the transition to nullable types by default in future. So then I would think that using the more explicit "object_nullable" for now could maybe be better than introducing the capitalized form if it was agreed that the repr is just "object" without the subtleties of capitalization. |
Also note that in the PDEP it was written "tentatively named |
No matter which of the two approaches above is considered, I think the arguments for using a mask as opposed to a sentinel are probably the same: Using a Boolean mask is generally seen as the preferred design in pandas for extension arrays. It provides uniform missing-value handling, clearer data semantics, and is more in line with how extension types like "Int64" and "boolean" have been developed, not to mention the nullable string array which it is intended to re-use for the nullable object implementation. This separation is one reason why the design of a nullable object dtype would benefit from a dedicated missingness mask rather than trying to “magically” interpret pd.NA embedded in an otherwise generic Python object array. IIUC pd.NA was designed as the representation of missing values in nullable arrays and was never intended to be an explicit sentinel value. I think many of the issues with pd.NA in object arrays arise from users thinking that the pd.NA object itself is a missing value and not a representation of a missing value. Of course a python object array can hold any object and so we can't, or maybe don't, stop users putting the pd.NA object in the traditional numpy object array. To compare the two approaches: Embedded pd.NA:
Separate Boolean Mask:
|
data and data that is better represented by a nullable | ||
container supporting missing values. | ||
|
||
This proposal is driven by frequent community discussions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there links for any of these? off the top of my head i dont recall any users asking for this in particular
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to list the issues in the OP before officially opening the discussion period, but yes this does need references. Thanks for highlighting this.
dtype to object_nullable using a constructor or an | ||
explicit method (e.g., `pd.array(old_array, | ||
dtype="object_nullable")`) using the existing api. | ||
- Operations on existing pandas nullable dtypes that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples? the only one that comes to mind is concat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the str.split issue came to mind when I wrote this.
legacy object dtype does not provide a robust and | ||
consistent solution, aligning with the design of other | ||
extension arrays. | ||
- Not having a nullable object dtype could potentially |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am i right in thinking this is the underlying motivation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried not too much to "sell" the concept until I get a temperature check from the other core devs. I'm happy to remove this.
polars has a nullable object type and some contributors seem to see polars as a threat or maybe just prefer us to match polars behavior. That could be a motivation if one is needed.
Ideally I would want to avoid object dtype where possible, PDEP-10 does that for many new datatypes. I opened this as it was looking like we will be rejecting PDEP-10 and if we have object dtype I think we should have a nullable one.
A few API question it'd be helpful to see addressed explicitly:
|
I think @jbrockmendel questions are very good and worth having explicit in the proposal. Personally I think While it may be counterintuitive at first based on past behavior, I think it's the simplest to implement, to explain, and to understand. |
Just posting high level concerns:
|
Sorry if I miss it, but is the plan to implement the nullable object backed by numpy arrays only? I don't think we have an object dtype based on Arrow as Polars do, right? Back to the discussion about naming, I think |
what i've said in the initial draft is
So I'm assuming that to ease the transition that all these will be treated as missing, updating the mask appropriately and represented as pd.NA. So my assumption is that we do silently replace. Do you disagree with this or perhaps prefer warnings for the assignment? |
seems reasonable. I assumed we would match the behavior of other nullable extension arrays for consistency. I'll audit this before opening the official discussion period. Thanks for highlighting this. |
Oh one more:
|
yes I wanted to discuss this in PDEP-16 as the type mapping in the first commit showed that the traditional numpy object dtype would be retained. A few days away from a full year since that was opened and the PDEP in incomplete with no discussion. sitting on a draft PDEP is harming the project. If we can get PDEP-16 moving then I would probably not have opened this. I think being part of the bigger discussion is crucial if the object_nullable was to become a default in the future. I did not explicitly state that in this initial draft even though I made the comment that could be interpreted as the motivation for this PDEP. As I said in #61599 (comment) i'm happy to remove that. The motivation is just to create a dtype consistent with the other pandas nullable dtypes. We have Int, Float, Boolean etc but do not have one for object. |
under the current proposal these would be missing in the mask (represented as the pd.NA scalar). if we instead do what @datapythonista suggests in #61599 (comment) then this would yield np.nan and would be a sentinel value. |
to be compatible with pyArrow and polars? |
Sorry had some typos in that statement. I meant to say "While pandas still considers Well there no compatibility with those two since they don't have a type to store arbitrary PyObjects. I think it's a potential "strength" that pandas can provide a "nullable object dtype" to store PyObjects with null semantics. My concern is pandas still conflating So for example, |
Behind a flag. Not necessarily making any commitment here to changes to the existing pandas api. To allow evaluation of the concept as we did with the other pandas nullable dtypes at first. The StringDtype has been available for a long time now before being made a pandas default type. I expect object_nullable to be experimental for a long time and to iron out any consistency issues with the other pd.NA variants of the pandas nullable dtypes and consistency issues with the legacy object dtype with the handling of values that are considered missing as @jbrockmendel highlighted. |
Ah i see, I interpreted that as we could have an arrow array containing the pointers to the python objects. the nullable string array is an nullable object array with constraints. The nullable object array would not be a string array. the constraints are it's strengths as we don't have the np.nan, None, pd.NAT issue?
only if |
+1, and hopefully that can give us inspiration to decouple the base
That and a general observation that users continue to store arbitrary objects (numpy arrays, pandas objects, custom classes) in pandas objects, so providing nullability to that I guess is a plus |
hopefully not. The intention was an array with exactly the same NA semantics of the other pd.NA variants of the pandas nullable types, i.e. the original nullable string dtype, Int, Float, Bool etc. Should be no differences in my opinion. The issues regarding handing of np.nan are perhaps similar to the array of issue for the pandas nullable float type if we allow it. If we don't it's not an issue but the behavior of the object and object_nullable would be different. In an ideal world we would perhaps want as much backwards compatibility as possible. If we don't have pyarrow as a required dependency, then I would expect object backed variants of the new dtypes. These would presumably be based on an nullable object array with constraints. For these cases the constraints would be their strength. I've not included that as a motivation either at this point. If the discussion here exceeds the PDEP-10/PDEP-15 discussion which I expect it to following the timescales in PDEP-1 that could be added depending on the outcome. I assuming PDEP-10 is rejected. If PyArrow is made a required dependency for 3.0 then maybe we could shift focus to a nullable object array backed by pyarrow pointers to python objects instead. |
I see that discussing accessors is a glaring omission from my draft. |
Interesting idea. The proposed nullable object dtype with a list accessor would be well suited to the return type of str.split(expand=False) of the pd.NA variant of the pandas nullable string dtype. |
That's also an interesting idea. Designing a nullable object array from scratch instead of trying to match all the functionality of the current numpy object array! |
For context, I do think we're requiring PyArrow in pandas 3.0, and using PyArrow types by default. My comments here are based on it. Unfortunately we need to wait an extra month to find out. But spending the next two years reinventing the nullable types that PyArrow already give us for free seems a very poor investment of our time. And even if we do that, I'd do it by continuing to release 2.x versions. We'll continue this discussion in the appropriate threads, just wanted to clarify that we are probably discussing this new PyObject type from very different points of view, as PDEP-15 is keeping the direction of the project very uncertain. |
Where has this been discussed? My understanding is that the PyArrow types are just experimental and the pandas nullable types will be the default in the future (as per PDEP-16 when its done) |
This can't be discussed until there is clarity about PDEP-10 and PDEP-15 in my opinion. And it's been like 3 years of PyArrow types being experimental anyway, so I think it's something worth discussing and considering if we do require PyArrow, no? |
I read the above as using PyArrow types by default in 3.0. That can't happen unless we postpone the 3.0 release and as per our deprecation policy have at least 2 minor releases in 2.x with the appropriate warnings for the breaking changes. |
I don't think I've seen that seriously considered/discussed anywhere. There's a path to making it feasible to use pyarrow types by default, but that path probably takes multiple major release cycles. Doing it for 3.0 would be, frankly, insane. |
I was initially shocked by the statement. I thought I had missed some significant discussion. Glad I misread it.
Where has this been discussed? |
The "current proposal" option here is inconsistent with the Floating/Integer behavior. But it is consistent with the setitem/constructor behavior. Is everyone OK with that? |
I'm not aware that is has been. I'm just claiming that it a path exists. A long, difficult path. |
with some major changes to the make up of the active contributors in the past few years, it may be that a high proportion of the current active contributors favor the ArrowDtype over the numpy backed pandas nullable arrays. Maybe we need to have this discussion sooner rather than later? I've always assumed that the pandas nullable types would be the future defaults and not challenged that. But if that is not the direction the majority of the voting members now want moving forward is going to be more difficult. |
Sorry Simon I ended up hijacking the discussion here. I created #61618 to discuss about PyArrow default types, so we can keep this PR / PDEP focused on its goal. |
Thanks @datapythonista for opening #61618. I just seek clarity and will surely consider the proposal, especially as PDEP-16 has failed to involve the community and other core developers on an "agreement" made 2 years ago. |
yes, I made some assumptions in the proposal about the future direction of pandas. PDEP-15 and PDEP-16 in particular. And not at all hostile to the content of PDEP-16, only the frustration at not moving to a vote to date so that the PDEP is accepted and the item is on our "roadmap". I've had discussions with @WillAyd when reviewing the pandas cookbook related to the future direction of pandas which is not clear without approved PDEPs and roadmap items. The cookbook promotes the pandas nullable dtypes as pandas preferred approach but I'm sure Will would have preferred to use the ArrowDTypes in the examples. Will, be sure to correct me if i'm in any way misinterpreting/misrepresenting you. So with respect to PDEP-16, it is clarity that I seek to reduce the potential harm to the project of core dves pulling in two different directions. This proposal was intended to align with what I assumed were the goals of PDEP-16, i.e. a transition path to pandas nullable type by default. |
Kind of...I ultimately just wanted to teach people a way that they could easily and consistently express the types of their data in pandas and operate on it. Completely ignoring the numpy versus extension versus arrow types unfortunately was rather challenging, because the choice of any of those can have drastic implications in terms of performance, usability, and subsequent type usage. I personally find that unfortunate and would prefer we get out of caring about the physical implementation of a type. PDEP-14 was my aim to address that, but I have since abandoned it.
FWIW we had a lengthy discussion about this in PDEP-16 already with respect to floating point data types, and the idea was that even though its less "correct," given our history assigning np.nan should still assign a missing value. See #58988 (comment) - I think that same logic should be held here. Overall on the proposal I think there's some merit to doing this. My main hesitation is how we guide users into using new types like this. A good example would be users that truly have list-like or dict-like data; we have a potential as a library to better serve those types, especially if we follow through on PDEP-10. Of course, today people are using object type to store those values and this wouldn't change that, but I feel like we are sending them down a winding path if we tell them to use Object going forward, get them used to the semantics there, and then try to port them to a List or Dict type in the future |
My intention was that the nullable object array is experimental and I agree that this transition path is a bad idea. That was not an intention of the proposed dtype or a suggested path but I understand a concern that if the dtype was available some users would use it for that purpose. My assumption is that if would be a base class for object fallback nested datatypes if PyArrow remained an optional dependency. My preference is to avoid object dtype where possible in the first instance, adding the new datatypes in PDEP-10 and then using "nullable_object" for what's left. So I envisioned a different transition path of object to new datatypes followed by remaining object to "object_nullable" There seems to be some interest though in a Pyarrow backed nullable object array and I had not considered that as part the proposal, only from a perspective of the missing value indicator being pd.NA that "object_nullable" would be more compatible with the ArrowDtypes and avoid mixed missing value semantics for arrow users with some object data columns. |
@jbrockmendel this is a more interesting example. The other 4 fall squarely in the domain of missing value handling. This one could be considered more conceptional and could drive the motivation for the need (or not) of a nullable object array. Many discussions revolve around the use of object for types that aren't supported natively in the columnar context. Hence presumably the suggestions to provide a much simpler api that just allows mapping of the objects to a supported type using UDF. However this example effectively shows a heterogeneous array of supported data types, often a result of selecting a row, or other operations or transformations using axis argument. Now I'm not sure how that's handled today with the ArrowDtype system. I think that this example potentially supports the current proposal that the nullable object type would need to match the behavior of the legacy object dtype as much as possible and I think that would also require all the accessors available on the numpy object array in order to selectively process heterogeneous arrays of supported types. |
df = pd.DataFrame(
{
"Int64": pd.Series([1, 2, pd.NA], dtype="Int64"),
"string": pd.Series(["a", "b", pd.NA], dtype="string"),
}
)
df.iloc[2]
# Int64 <NA>
# string <NA>
# Name: 2, dtype: object should pandas be putting |
This is a good example, and also something where I'd pitch that a logical type system with a "Missing" type would serve us better than trying to stuff this into an object type |
In the past when the idea of a missing dtpye has come up it has been in the context of |
The case @simonjayhawkins just showed where you can have an array of all missing values. I wouldn't expect a user to create that directly, but it can certainly happen throughout a sequence of operations. Its easier to have a dedicated missing type, inspect its metadata as it is passed along, and branch accordingly, rather than having an object-dtype and then having to inspect each element to determine if it is all missing |
Uhh if the motivating use case is the example simon posted of |
Ignoring PDEP-16 for now, I think I was expecting the current behavior using numpy missing value semantics would in this case be np.nan for the Int64 value and np.nan/None for the string (to match legacy string which was object) and pd.NaT if the columns had be temporal. This being the assumption based on the return type being traditional numpy array and so honoring the typical missing values used in object. If we had a pandas nullable object dtype then pd.NA would be fine? but pd.NA in object doesn't work so well. |
and IIRC I have in some cases been telling users not to do it when they submit a bug report. So I was surprised that pandas can produce pd.NA in object type. so for example #60049, when the user expects the behavior of a nullable boolean by inserting a pd.NA value, i've tended to tell them that their expectations and usage are incorrect. |
the first case in the issue linked in the PDEP #32931 is pd.Series([1, pd.NA], dtype=object) >= 1
# 0 True
# 1 False
# dtype: bool so as a temperature check do others feel that putting pd.NA in an object array is inappropriate usage? perhaps we shouldn't allow it? or should we handle it correctly? |
I think there is very little value in trying to change this; the semantics of this are really unclear and overloaded in a variety of contexts. Its just a gap of the existing type system |
#61182 is a case where the issue may have been incorrectly closed |
not really, the existing type system didn't suffer from any of the issues until pd.NA was introduced. And users expect pd.NA to be a pandas missing value. I'm not prepared to dismiss all the issues regarding pd.NA in object type so easily. There are issues now dating back over 5 years and hoping they will all go away when we adopt nullable types by default someday does, to me, not seem to be a good plan. |
I'm seeing numpy object dtype for row selection with Arrow types. This surely can't be right for the ArrowDtype system? What's the plan if we go for arrow by default? |
as per PDEP-1
but comments are surely welcome in the meantime.