PDEP-18: Nullable Object Dtype #61599

simonjayhawkins · 2025-06-07T15:39:36Z

as per PDEP-1

The initial status of a PDEP will be Status: Draft. This will be changed to Status: Under discussion by the author(s), when they are ready to proceed with the decision making process.

but comments are surely welcome in the meantime.

jbrockmendel · 2025-06-07T17:59:04Z

“Object” analogous to “Float64”?

datapythonista · 2025-06-07T17:59:23Z

At least to me the PDEP will be easier to read (and comment on) if you limit the line width, to 80 or similar.

The idea sounds good, it'd be good if you can provide information on how using a boolean mask compares to having pandas.NA inside the main array.

simonjayhawkins · 2025-06-07T18:12:07Z

At least to me the PDEP will be easier to read (and comment on) if you limit the line width, to 80 or similar.

sure.

it'd be good if you can provide information on how using a boolean mask compares to having pandas.NA inside the main array.

using a sentinel as opposed to a mask is an implementation detail that I can expand on. I'm assuming that this would still be a separate dtype from the traditional numpy dtype, a pandas nullable dtype? We have the string array backed by a masked object array that I was effectively proposing reusing/refactoring as a base class.

There is also another option which is maybe what you are proposing: making a breaking change to the exisiting numpy object dtype to handle pd.NA differently? This is perhaps what is in the rejected ideas section and that needs clarification?

simonjayhawkins · 2025-06-07T18:15:17Z

“Object” analogous to “Float64”?

That's the obvious choice but IIRC the capitalization was considered as confusing/non intuitive by some when discussed with respect to the string dtype.

simonjayhawkins · 2025-06-08T09:37:57Z

“Object” analogous to “Float64”?

That's the obvious choice but IIRC the capitalization was considered as confusing/non intuitive by some when discussed with respect to the string dtype.

naming the nullable object dtype "Object" aligns well with pandas’ approach to evolving its dtypes (like "Int64" for integers, "Float64" for floats and "Boolean" for the nullable bool dtype). It creates a clear visual and semantic cue for users that a nullable, extension-based implementation is being used. As long as the design and documentation explicitly address the differences between the legacy object dtype and "Object", this approach could indeed enhance clarity and usability even though it it less explicit that "object_nullable".

Also bear in mind that, being effectively a tweek to the pd.NA variant of the python backed string dtype, the repr could be just "object", just as "string[pyarrow]" is shown as just "string". This effectively indicates the logical type and that instead of using the dtype string alias we could recommend constructing the nullable array for testing/evaluation using the Dtype object using the patterns that there seemed to be some consensus on in PDEP-13. There may be advantage of using "object" as the repr instead of "Object" as this could potentially simplify the transition to nullable types by default in future. So then I would think that using the more explicit "object_nullable" for now could maybe be better than introducing the capitalized form if it was agreed that the repr is just "object" without the subtleties of capitalization.

simonjayhawkins · 2025-06-08T09:51:39Z

Also note that in the PDEP it was written "tentatively named "object_nullable"" based on the passage of PDEP-14 which needed a sub discussion to address. The words chosen purposely not to set the dtype string alias in stone and allowing the discussion to potentially avoid this debate in the main discussion thread.

simonjayhawkins · 2025-06-08T10:40:48Z

it'd be good if you can provide information on how using a boolean mask compares to having pandas.NA inside the main array.

using a sentinel as opposed to a mask is an implementation detail that I can expand on. I'm assuming that this would still be a separate dtype from the traditional numpy dtype, a pandas nullable dtype? We have the string array backed by a masked object array that I was effectively proposing reusing/refactoring as a base class.

There is also another option which is maybe what you are proposing: making a breaking change to the exisiting numpy object dtype to handle pd.NA differently? This is perhaps what is in the rejected ideas section and that needs clarification?

No matter which of the two approaches above is considered, I think the arguments for using a mask as opposed to a sentinel are probably the same:

Using a Boolean mask is generally seen as the preferred design in pandas for extension arrays. It provides uniform missing-value handling, clearer data semantics, and is more in line with how extension types like "Int64" and "boolean" have been developed, not to mention the nullable string array which it is intended to re-use for the nullable object implementation.

This separation is one reason why the design of a nullable object dtype would benefit from a dedicated missingness mask rather than trying to “magically” interpret pd.NA embedded in an otherwise generic Python object array. IIUC pd.NA was designed as the representation of missing values in nullable arrays and was never intended to be an explicit sentinel value. I think many of the issues with pd.NA in object arrays arise from users thinking that the pd.NA object itself is a missing value and not a representation of a missing value. Of course a python object array can hold any object and so we can't, or maybe don't, stop users putting the pd.NA object in the traditional numpy object array.

To compare the two approaches:

Embedded pd.NA:

Maybe a simpler conceptual model for small or homogeneous arrays where overhead is minimal.
Checking each element at runtime might involve extra comparisons
When missing values are stored in the same array as valid data, ensuring that operations treat pd.NA consistently can be tricky.

Separate Boolean Mask:

Offers a clear and robust way to denote missing data while preserving clean data arrays; aligns with the design of other extension types in pandas; facilitates efficient and consistent missing data handling across operations
Introduces additional complexity in data structure design. This is not really an issue as there is no POC needed as the nullable object array shares so much code with the tried and tested pd.NA variant of the string array, available for a long time now.
Separating the missingness information from the actual data often leads to more readable, more performant and more maintainable code. When operations are performed, pandas can first consult the mask to identify missing elements and then process only the valid ones, or process only the missing values with operations such as fillna. In vectorized operations, many operations can short-circuit based on the mask.
The extra Boolean mask does add memory overhead, though future implementations could optimize the mask as a bitmask rather than a full Boolean array. This is applicable to all pandas nullable types and so that discussion would be outside the scope of this PDEP.

jbrockmendel · 2025-06-09T16:19:01Z

web/pandas/pdeps/0018-nullable-object-dtype.md

+data and data that is better represented by a nullable
+container supporting missing values.
+
+This proposal is driven by frequent community discussions


are there links for any of these? off the top of my head i dont recall any users asking for this in particular

I plan to list the issues in the OP before officially opening the discussion period, but yes this does need references. Thanks for highlighting this.

for the following part of the sentence

... and development efforts ...

i could potentially reference PDEP-16?

jbrockmendel · 2025-06-09T16:19:59Z

web/pandas/pdeps/0018-nullable-object-dtype.md

+    dtype to object_nullable using a constructor or an
+    explicit method (e.g., `pd.array(old_array,
+    dtype="object_nullable")`) using the existing api.
+    - Operations on existing pandas nullable dtypes that


examples? the only one that comes to mind is concat.

the str.split issue came to mind when I wrote this.

examples? the only one that comes to mind is concat.

That's #61303.

It appears to be fixed on main but not bisected to find where this was changed.

jbrockmendel · 2025-06-09T16:21:04Z

web/pandas/pdeps/0018-nullable-object-dtype.md

+    legacy object dtype does not provide a robust and
+    consistent solution, aligning with the design of other
+    extension arrays.
+    - Not having a nullable object dtype could potentially


am i right in thinking this is the underlying motivation?

I tried not too much to "sell" the concept until I get a temperature check from the other core devs. I'm happy to remove this.

polars has a nullable object type and some contributors seem to see polars as a threat or maybe just prefer us to match polars behavior. That could be a motivation if one is needed.

Ideally I would want to avoid object dtype where possible, PDEP-10 does that for many new datatypes. I opened this as it was looking like we will be rejecting PDEP-10 and if we have object dtype I think we should have a nullable one.

jbrockmendel · 2025-06-09T16:28:12Z

A few API question it'd be helpful to see addressed explicitly:

ser[0] = np.nan Does this assign NaN or does this silently replace with pd.NA?
Same question with NaN in a list passed to the constructor.
Same questions but with None or pd.NaT
ser = pd.Series([pd.NA, None, pd.NaT, np.nan], dtype="object_nullable"). (Assuming we don't silently replace) What do isna/fillna/skipna do?

datapythonista · 2025-06-09T16:42:48Z

I think @jbrockmendel questions are very good and worth having explicit in the proposal.

Personally I think pd.NA should be used for setting values in the boolean mask, everything else should go into the values and not have special treatment.

While it may be counterintuitive at first based on past behavior, I think it's the simplest to implement, to explain, and to understand.

mroeschke · 2025-06-09T17:13:01Z

Just posting high level concerns:

While pandas still object a "string dtype", would "object_nullable" be yet-another-string-dtype with an alternative NA semantics? I would be more comfortable with this type if pandas just considers object as purely PyObjects
Maybe a meta comment about this topic, I do think discussing a "nullable object type" would be better suited in a PDEP that discusses all "nullable types" (system) to avoid fragmentation/diversion of terminology, null semantics, etc. as inevitability the other types will be discussed as well.

datapythonista · 2025-06-09T17:39:22Z

Sorry if I miss it, but is the plan to implement the nullable object backed by numpy arrays only? I don't think we have an object dtype based on Arrow as Polars do, right?

Back to the discussion about naming, I think pyobject[pyarrow] and I guess pyobject[numpy] would be my preference. I think pyobject is more explicit and way more clear than object.

simonjayhawkins · 2025-06-09T17:56:11Z

A few API question it'd be helpful to see addressed explicitly:

ser[0] = np.nan Does this assign NaN or does this silently replace with pd.NA?

Same question with NaN in a list passed to the constructor.

Same questions but with None or pd.NaT

ser = pd.Series([pd.NA, None, pd.NaT, np.nan], dtype="object_nullable"). (Assuming we don't silently replace) What do isna/fillna/skipna do?

what i've said in the initial draft is

The proposed nullable object array will be
unable to hold np.nan, None or pd.NaT as these will be
considered missing in the constructors and other conversions
when following the existing API for the other nullable
types. Users will not be able to round-trip between the
legacy and nullable object dtypes.

So I'm assuming that to ease the transition that all these will be treated as missing, updating the mask appropriately and represented as pd.NA. So my assumption is that we do silently replace. Do you disagree with this or perhaps prefer warnings for the assignment?

simonjayhawkins · 2025-06-09T18:00:49Z

Personally I think pd.NA should be used for setting values in the boolean mask, everything else should go into the values and not have special treatment.

While it may be counterintuitive at first based on past behavior, I think it's the simplest to implement, to explain, and to understand.

seems reasonable. I assumed we would match the behavior of other nullable extension arrays for consistency. I'll audit this before opening the official discussion period.

Thanks for highlighting this.

jbrockmendel · 2025-06-09T18:09:42Z

Oh one more:

ser = pd.Series([2, pd.Timedelta(1)], dtype="nullable_object") # what happens with ser / 0?

simonjayhawkins · 2025-06-09T18:12:53Z

2. Maybe a meta comment about this topic, I do think discussing a "nullable object type" would be better suited in a PDEP that discusses all "nullable types" (system) to avoid fragmentation/diversion of terminology, null semantics, etc. as inevitability the other types will be discussed as well.

yes I wanted to discuss this in PDEP-16 as the type mapping in the first commit showed that the traditional numpy object dtype would be retained. A few days away from a full year since that was opened and the PDEP in incomplete with no discussion. sitting on a draft PDEP is harming the project. If we can get PDEP-16 moving then I would probably not have opened this.

I think being part of the bigger discussion is crucial if the object_nullable was to become a default in the future. I did not explicitly state that in this initial draft even though I made the comment that could be interpreted as the motivation for this PDEP. As I said in #61599 (comment) i'm happy to remove that.

The motivation is just to create a dtype consistent with the other pandas nullable dtypes. We have Int, Float, Boolean etc but do not have one for object.

simonjayhawkins · 2025-06-09T18:24:18Z

Oh one more:

ser = pd.Series([2, pd.Timedelta(1)], dtype="nullable_object") # what happens with ser / 0?

under the current proposal these would be missing in the mask (represented as the pd.NA scalar). if we instead do what @datapythonista suggests in #61599 (comment) then this would yield np.nan and would be a sentinel value.

simonjayhawkins · 2025-06-09T18:25:52Z

While pandas still object a "string dtype", would "object_nullable" be yet-another-string-dtype with an alternative NA semantics? I would be more comfortable with this type if pandas just considers object as purely PyObjects

to be compatible with pyArrow and polars?

mroeschke · 2025-06-09T18:35:07Z

While pandas still object a "string dtype", would "object_nullable" be yet-another-string-dtype with an alternative NA semantics? I would be more comfortable with this type if pandas just considers object as purely PyObjects

to be compatible with pyArrow and polars?

Sorry had some typos in that statement. I meant to say "While pandas still considers object a "string dtype"".

Well there no compatibility with those two since they don't have a type to store arbitrary PyObjects. I think it's a potential "strength" that pandas can provide a "nullable object dtype" to store PyObjects with null semantics. My concern is pandas still conflating object with "string".

So for example, str APIs still work with object type. Would you expect str APIs to work with object_nullable?

simonjayhawkins · 2025-06-09T18:37:07Z

The motivation is just to create a dtype consistent with the other pandas nullable dtypes.

Behind a flag. Not necessarily making any commitment here to changes to the existing pandas api. To allow evaluation of the concept as we did with the other pandas nullable dtypes at first. The StringDtype has been available for a long time now before being made a pandas default type. I expect object_nullable to be experimental for a long time and to iron out any consistency issues with the other pd.NA variants of the pandas nullable dtypes and consistency issues with the legacy object dtype with the handling of values that are considered missing as @jbrockmendel highlighted.

simonjayhawkins · 2025-06-09T18:45:23Z

While pandas still object a "string dtype", would "object_nullable" be yet-another-string-dtype with an alternative NA semantics? I would be more comfortable with this type if pandas just considers object as purely PyObjects

to be compatible with pyArrow and polars?

Sorry had some typos in that statement. I meant to say "While pandas still considers object a "string dtype"".

Well there no compatibility with those two since they don't have a type to store arbitrary PyObjects. I think it's a potential "strength" that pandas can provide a "nullable object dtype" to store PyObjects with null semantics. My concern is pandas still conflating object with "string".

Ah i see, I interpreted that as we could have an arrow array containing the pointers to the python objects.

the nullable string array is an nullable object array with constraints. The nullable object array would not be a string array. the constraints are it's strengths as we don't have the np.nan, None, pd.NAT issue?

So for example, str APIs still work with object type. Would you expect str APIs to work with object_nullable?

only if object dtype retains the str accessor. If we remove it from object then it would not be needed from "object_nullable"

mroeschke · 2025-06-09T18:56:37Z

The nullable object array would not be a string array.

+1, and hopefully that can give us inspiration to decouple the base object dtype from meaning "string" in the future (not for this PDEP)

the constraints are it's strengths as we don't have the np.nan, None, pd.NAT issue?

That and a general observation that users continue to store arbitrary objects (numpy arrays, pandas objects, custom classes) in pandas objects, so providing nullability to that I guess is a plus

simonjayhawkins · 2025-06-09T19:06:17Z

While pandas still object a "string dtype", would "object_nullable" be yet-another-string-dtype with an alternative NA semantics?

hopefully not.

The intention was an array with exactly the same NA semantics of the other pd.NA variants of the pandas nullable types, i.e. the original nullable string dtype, Int, Float, Bool etc. Should be no differences in my opinion. The issues regarding handing of np.nan are perhaps similar to the array of issue for the pandas nullable float type if we allow it. If we don't it's not an issue but the behavior of the object and object_nullable would be different. In an ideal world we would perhaps want as much backwards compatibility as possible.

If we don't have pyarrow as a required dependency, then I would expect object backed variants of the new dtypes. These would presumably be based on an nullable object array with constraints. For these cases the constraints would be their strength. I've not included that as a motivation either at this point. If the discussion here exceeds the PDEP-10/PDEP-15 discussion which I expect it to following the timescales in PDEP-1 that could be added depending on the outcome. I assuming PDEP-10 is rejected. If PyArrow is made a required dependency for 3.0 then maybe we could shift focus to a nullable object array backed by pyarrow pointers to python objects instead.

simonjayhawkins · 2025-06-09T19:10:56Z

So for example, str APIs still work with object type. Would you expect str APIs to work with object_nullable?

only if object dtype retains the str accessor. If we remove it from object then it would not be needed from "object_nullable"

I see that discussing accessors is a glaring omission from my draft.

jbrockmendel · 2025-06-09T21:32:33Z

Oh one more:

ser = pd.Series([2, pd.Timedelta(1)], dtype="nullable_object") # what happens with ser / 0?

under the current proposal these would be missing in the mask (represented as the pd.NA scalar). if we instead do what @datapythonista suggests in #61599 (comment) then this would yield np.nan and would be a sentinel value.

The "current proposal" option here is inconsistent with the Floating/Integer behavior. But it is consistent with the setitem/constructor behavior. Is everyone OK with that?

jbrockmendel · 2025-06-09T21:34:59Z

There's a path to making it feasible to use pyarrow types by default [...]

Where has this been discussed?

I'm not aware that is has been. I'm just claiming that it a path exists. A long, difficult path.

simonjayhawkins · 2025-06-09T21:46:00Z

with some major changes to the make up of the active contributors in the past few years, it may be that a high proportion of the current active contributors favor the ArrowDtype over the numpy backed pandas nullable arrays. Maybe we need to have this discussion sooner rather than later? I've always assumed that the pandas nullable types would be the future defaults and not challenged that. But if that is not the direction the majority of the voting members now want moving forward is going to be more difficult.

datapythonista · 2025-06-09T21:55:23Z

Sorry Simon I ended up hijacking the discussion here. I created #61618 to discuss about PyArrow default types, so we can keep this PR / PDEP focused on its goal.

simonjayhawkins · 2025-06-09T21:57:27Z

Thanks @datapythonista for opening #61618. I just seek clarity and will surely consider the proposal, especially as PDEP-16 has failed to involve the community and other core developers on an "agreement" made 2 years ago.

simonjayhawkins · 2025-06-10T09:15:03Z

so we can keep this PR / PDEP focused on its goal.

yes, I made some assumptions in the proposal about the future direction of pandas. PDEP-15 and PDEP-16 in particular. And not at all hostile to the content of PDEP-16, only the frustration at not moving to a vote to date so that the PDEP is accepted and the item is on our "roadmap".

I've had discussions with @WillAyd when reviewing the pandas cookbook related to the future direction of pandas which is not clear without approved PDEPs and roadmap items. The cookbook promotes the pandas nullable dtypes as pandas preferred approach but I'm sure Will would have preferred to use the ArrowDTypes in the examples. Will, be sure to correct me if i'm in any way misinterpreting/misrepresenting you.

So with respect to PDEP-16, it is clarity that I seek to reduce the potential harm to the project of core dves pulling in two different directions.

This proposal was intended to align with what I assumed were the goals of PDEP-16, i.e. a transition path to pandas nullable type by default.

WillAyd · 2025-06-10T12:44:14Z

I've had discussions with @WillAyd when reviewing the pandas cookbook related to the future direction of pandas which is not clear without approved PDEPs and roadmap items. The cookbook promotes the pandas nullable dtypes as pandas preferred approach but I'm sure Will would have preferred to use the ArrowDTypes in the examples. Will, be sure to correct me if i'm in any way misinterpreting/misrepresenting you.

Kind of...I ultimately just wanted to teach people a way that they could easily and consistently express the types of their data in pandas and operate on it. Completely ignoring the numpy versus extension versus arrow types unfortunately was rather challenging, because the choice of any of those can have drastic implications in terms of performance, usability, and subsequent type usage. I personally find that unfortunate and would prefer we get out of caring about the physical implementation of a type. PDEP-14 was my aim to address that, but I have since abandoned it.

A few API question it'd be helpful to see addressed explicitly:
1. `ser[0] = np.nan`  Does this assign NaN or does this silently replace with pd.NA?

FWIW we had a lengthy discussion about this in PDEP-16 already with respect to floating point data types, and the idea was that even though its less "correct," given our history assigning np.nan should still assign a missing value. See #58988 (comment) - I think that same logic should be held here.

Overall on the proposal I think there's some merit to doing this. My main hesitation is how we guide users into using new types like this. A good example would be users that truly have list-like or dict-like data; we have a potential as a library to better serve those types, especially if we follow through on PDEP-10. Of course, today people are using object type to store those values and this wouldn't change that, but I feel like we are sending them down a winding path if we tell them to use Object going forward, get them used to the semantics there, and then try to port them to a List or Dict type in the future

simonjayhawkins · 2025-06-10T13:31:07Z

but I feel like we are sending them down a winding path if we tell them to use Object going forward, get them used to the semantics there, and then try to port them to a List or Dict type in the future

My intention was that the nullable object array is experimental and I agree that this transition path is a bad idea. That was not an intention of the proposed dtype or a suggested path but I understand a concern that if the dtype was available some users would use it for that purpose. My assumption is that if would be a base class for object fallback nested datatypes if PyArrow remained an optional dependency.

My preference is to avoid object dtype where possible in the first instance, adding the new datatypes in PDEP-10 and then using "nullable_object" for what's left. So I envisioned a different transition path of object to new datatypes followed by remaining object to "object_nullable"

There seems to be some interest though in a Pyarrow backed nullable object array and I had not considered that as part the proposal, only from a perspective of the missing value indicator being pd.NA that "object_nullable" would be more compatible with the ArrowDtypes and avoid mixed missing value semantics for arrow users with some object data columns.

simonjayhawkins · 2025-06-10T15:44:54Z

Oh one more:

ser = pd.Series([2, pd.Timedelta(1)], dtype="nullable_object") # what happens with ser / 0?

@jbrockmendel this is a more interesting example.

The other 4 fall squarely in the domain of missing value handling.

This one could be considered more conceptional and could drive the motivation for the need (or not) of a nullable object array.

Many discussions revolve around the use of object for types that aren't supported natively in the columnar context. Hence presumably the suggestions to provide a much simpler api that just allows mapping of the objects to a supported type using UDF.

However this example effectively shows a heterogeneous array of supported data types, often a result of selecting a row, or other operations or transformations using axis argument. Now I'm not sure how that's handled today with the ArrowDtype system.

I think that this example potentially supports the current proposal that the nullable object type would need to match the behavior of the legacy object dtype as much as possible and I think that would also require all the accessors available on the numpy object array in order to selectively process heterogeneous arrays of supported types.

simonjayhawkins · 2025-06-10T16:06:05Z

df = pd.DataFrame(
    {
        "Int64": pd.Series([1, 2, pd.NA], dtype="Int64"),
        "string": pd.Series(["a", "b", pd.NA], dtype="string"),
    }
)
df.iloc[2]
# Int64     <NA>
# string    <NA>
# Name: 2, dtype: object

should pandas be putting <NA> into an traditional numpy object array, or should this be np.nan (or even None or pd.NaT)?

WillAyd · 2025-06-10T16:08:33Z

should pandas be putting <NA> into an traditional numpy object array, or should this be np.nan (or even None or pd.NaT)?

This is a good example, and also something where I'd pitch that a logical type system with a "Missing" type would serve us better than trying to stuff this into an object type

jbrockmendel · 2025-06-10T17:39:42Z

a "Missing" type

In the past when the idea of a missing dtpye has come up it has been in the context of pd.Series() with no data passed. The main use case is in pd.concat where any entries with missing dtype would be ignored when determining the result dtype. IIUC what you are suggesting is distinct from that?

WillAyd · 2025-06-10T17:51:52Z

The case @simonjayhawkins just showed where you can have an array of all missing values. I wouldn't expect a user to create that directly, but it can certainly happen throughout a sequence of operations. Its easier to have a dedicated missing type, inspect its metadata as it is passed along, and branch accordingly, rather than having an object-dtype and then having to inspect each element to determine if it is all missing

jbrockmendel · 2025-06-10T18:07:58Z

Uhh if the motivating use case is the example simon posted of df.iloc[2] i definitely don't think the dtype should depend on whether the values are all-missing. We want to avoid value-dependent behavior. Besides I read simon's comment as suggesting he wanted to be able to distinguish between a float-missing and string-missing.

simonjayhawkins · 2025-06-10T18:20:58Z

Besides I read simon's comment as suggesting he wanted to be able to distinguish between a float-missing and string-missing.

Ignoring PDEP-16 for now, I think I was expecting the current behavior using numpy missing value semantics would in this case be np.nan for the Int64 value and np.nan/None for the string (to match legacy string which was object) and pd.NaT if the columns had be temporal. This being the assumption based on the return type being traditional numpy array and so honoring the typical missing values used in object. If we had a pandas nullable object dtype then pd.NA would be fine? but pd.NA in object doesn't work so well.

simonjayhawkins · 2025-06-10T18:33:26Z

but pd.NA in object doesn't work so well.

and IIRC I have in some cases been telling users not to do it when they submit a bug report. So I was surprised that pandas can produce pd.NA in object type.

so for example #60049, when the user expects the behavior of a nullable boolean by inserting a pd.NA value, i've tended to tell them that their expectations and usage are incorrect.

simonjayhawkins · 2025-06-10T18:51:40Z

but pd.NA in object doesn't work so well.

the first case in the issue linked in the PDEP #32931 is

pd.Series([1, pd.NA], dtype=object) >= 1
# 0     True
# 1    False
# dtype: bool

so as a temperature check do others feel that putting pd.NA in an object array is inappropriate usage? perhaps we shouldn't allow it? or should we handle it correctly?

WillAyd · 2025-06-10T19:06:23Z

so as a temperature check do others feel that putting pd.NA in an object array is inappropriate usage? perhaps we shouldn't allow it? or should we handle it correctly?

I think there is very little value in trying to change this; the semantics of this are really unclear and overloaded in a variety of contexts. Its just a gap of the existing type system

simonjayhawkins · 2025-06-10T19:09:53Z

and IIRC I have in some cases been telling users not to do it when they submit a bug report. So I was surprised that pandas can produce pd.NA in object type.

#61182 is a case where the issue may have been incorrectly closed

simonjayhawkins · 2025-06-10T19:28:41Z

Its just a gap of the existing type system

not really, the existing type system didn't suffer from any of the issues until pd.NA was introduced. And users expect pd.NA to be a pandas missing value. I'm not prepared to dismiss all the issues regarding pd.NA in object type so easily. There are issues now dating back over 5 years and hoping they will all go away when we adopt nullable types by default someday does, to me, not seem to be a good plan.

simonjayhawkins · 2025-06-10T19:50:07Z

However this example effectively shows a heterogeneous array of supported data types, often a result of selecting a row, or other operations or transformations using axis argument. Now I'm not sure how that's handled today with the ArrowDtype system.

I'm seeing numpy object dtype for row selection with Arrow types. This surely can't be right for the ArrowDtype system? What's the plan if we go for arrow by default?

simonjayhawkins · 2025-06-11T10:32:15Z

as per PDEP-1

The initial status of a PDEP will be Status: Draft. This will be changed to Status: Under discussion by the author(s), when they are ready to proceed with the decision making process.

but comments are surely welcome in the meantime.

Rechecking the procedural guidelines, opening the official discussion requires notification to the pandas-dev mailing list.

I have already had lots of good feedback Thanks @jbrockmendel @datapythonista @mroeschke @WillAyd

As a result there is plenty of suggestions that I need to consider before even considering changing to "Status: Under discussion"

in the meantime, further comments from @pandas-dev/pandas-core most welcome

simonjayhawkins · 2025-06-11T10:49:50Z

Oh one more:

ser = pd.Series([2, pd.Timedelta(1)], dtype="nullable_object") # what happens with ser / 0?

under the current proposal these would be missing in the mask (represented as the pd.NA scalar). if we instead do what @datapythonista suggests in #61599 (comment) then this would yield np.nan and would be a sentinel value.

I was wrong here. @datapythonista comment suggested that np.nan etc would not have any special treatment so would not be considered a sentinel value.

Dr-Irv · 2025-06-11T13:45:50Z

web/pandas/pdeps/0018-nullable-object-dtype.md

+    - Users should be able to convert from the legacy object
+    dtype to object_nullable using a constructor or an
+    explicit method (e.g., `pd.array(old_array,
+    dtype="object_nullable")`) using the existing api.


Could one use convert_dtypes() for this?

jbrockmendel · 2025-06-19T18:13:10Z

Should implementing a pyarrow object be rolled in to this? (Fwiw I’d be fine with implementing either without a pdep)

jorisvandenbossche · 2025-06-26T21:40:56Z

@simonjayhawkins at #61618 (comment), you wrote:

PDEP-18 was opened with the intent of describing a pandas nullable dtype only, however, that discussion moved to a PyArrow backed one

Can you clarify that? I only very quickly skimmed the discussion above (will try to do a proper read and give some feedback later), but I don't directly found anything related to that (except for one comment of Marc mentioning that polars has an object dtype and they use arrow under the hood).

As far as I know, pyarrow does not support storing python objects. I think using a numpy array of object dtype is the only option (for the data; sentinel vs mask for the missing values is still a separate discussion)

simonjayhawkins · 2025-06-27T11:09:55Z

specifically #61599 (comment) @datapythonista is reviewing this proposal in the context of a PyArrow backend.

and #61599 (comment) just above suggests that it may need consideration.

So following #61599 (comment), I'm happy for that discussion to progress more before considering that here.

(Side note: this PDEP is presented as a voluntary contribution. If it instead proposes a PyArrow backend then any work here would be funded instead, so that's another reason I'm not actively working on it until there is more clarity. Since we have no agreement for any funded work that I can progress at this time, it would be foolish of me to do this for free!)

jorisvandenbossche · 2025-06-27T11:38:15Z

That first comment of Marc is only about requiring pyarrow as a hard dependency, it does not explicitly talk about using pyarrow specifically for this proposed object dtype.

But so put it a bit more bluntly, although still as far as I know, it is impossible to use pyarrow currently for storing python objects, and we will just want to use a numpy object-dtype array for this.

(nothing is really impossible of course, it can probably be done by writing a C++ extension built against pyarrow .. but I don't think we want to go there)
And one somewhat related open feature request about this: apache/arrow#22340 (although this is about the IPC part, which we don't strictly need for an in-memory dataframe)

simonjayhawkins · 2025-06-27T11:49:36Z

Thanks @jorisvandenbossche for perhaps the sufficient clarity needed to progress this based on the original intent.

initial draft

4d28e0a

simonjayhawkins added the PDEP pandas enhancement proposal label Jun 7, 2025

simonjayhawkins added 2 commits June 7, 2025 19:27

correct spelling

2968157

word wrap

fde84ea

jbrockmendel reviewed Jun 9, 2025

View reviewed changes

datapythonista mentioned this pull request Jun 9, 2025

Moving to PyArrow dtypes by default #61618

Open

Dr-Irv reviewed Jun 11, 2025

View reviewed changes

Uh oh!

PDEP-18: Nullable Object Dtype #61599

Are you sure you want to change the base?

PDEP-18: Nullable Object Dtype #61599

Uh oh!

Conversation

simonjayhawkins commented Jun 7, 2025

Uh oh!

jbrockmendel commented Jun 7, 2025

Uh oh!

datapythonista commented Jun 7, 2025

Uh oh!

simonjayhawkins commented Jun 7, 2025

Uh oh!

simonjayhawkins commented Jun 7, 2025

Uh oh!

simonjayhawkins commented Jun 8, 2025

Uh oh!

simonjayhawkins commented Jun 8, 2025

Uh oh!

simonjayhawkins commented Jun 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Jun 9, 2025

Uh oh!

datapythonista commented Jun 9, 2025

Uh oh!

mroeschke commented Jun 9, 2025

Uh oh!

datapythonista commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

jbrockmendel commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

mroeschke commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

mroeschke commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

jbrockmendel commented Jun 9, 2025

Uh oh!

jbrockmendel commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

datapythonista commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 9, 2025

Uh oh!

simonjayhawkins commented Jun 10, 2025 •

edited

Loading

WillAyd commented Jun 10, 2025 •

edited

Loading

jorisvandenbossche commented Jun 26, 2025 •

edited

Loading

jorisvandenbossche commented Jun 27, 2025 •

edited

Loading