Skip to content

PDEP-18: Nullable Object Dtype #61599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

simonjayhawkins
Copy link
Member

as per PDEP-1

The initial status of a PDEP will be Status: Draft. This will be changed to Status: Under discussion by the author(s), when they are ready to proceed with the decision making process.

but comments are surely welcome in the meantime.

@simonjayhawkins simonjayhawkins added the PDEP pandas enhancement proposal label Jun 7, 2025
@jbrockmendel
Copy link
Member

“Object” analogous to “Float64”?

@datapythonista
Copy link
Member

At least to me the PDEP will be easier to read (and comment on) if you limit the line width, to 80 or similar.

The idea sounds good, it'd be good if you can provide information on how using a boolean mask compares to having pandas.NA inside the main array.

@simonjayhawkins
Copy link
Member Author

At least to me the PDEP will be easier to read (and comment on) if you limit the line width, to 80 or similar.

sure.

it'd be good if you can provide information on how using a boolean mask compares to having pandas.NA inside the main array.

using a sentinel as opposed to a mask is an implementation detail that I can expand on. I'm assuming that this would still be a separate dtype from the traditional numpy dtype, a pandas nullable dtype? We have the string array backed by a masked object array that I was effectively proposing reusing/refactoring as a base class.

There is also another option which is maybe what you are proposing: making a breaking change to the exisiting numpy object dtype to handle pd.NA differently? This is perhaps what is in the rejected ideas section and that needs clarification?

@simonjayhawkins
Copy link
Member Author

“Object” analogous to “Float64”?

That's the obvious choice but IIRC the capitalization was considered as confusing/non intuitive by some when discussed with respect to the string dtype.

@simonjayhawkins
Copy link
Member Author

“Object” analogous to “Float64”?

That's the obvious choice but IIRC the capitalization was considered as confusing/non intuitive by some when discussed with respect to the string dtype.

naming the nullable object dtype "Object" aligns well with pandas’ approach to evolving its dtypes (like "Int64" for integers, "Float64" for floats and "Boolean" for the nullable bool dtype). It creates a clear visual and semantic cue for users that a nullable, extension-based implementation is being used. As long as the design and documentation explicitly address the differences between the legacy object dtype and "Object", this approach could indeed enhance clarity and usability even though it it less explicit that "object_nullable".

Also bear in mind that, being effectively a tweek to the pd.NA variant of the python backed string dtype, the repr could be just "object", just as "string[pyarrow]" is shown as just "string". This effectively indicates the logical type and that instead of using the dtype string alias we could recommend constructing the nullable array for testing/evaluation using the Dtype object using the patterns that there seemed to be some consensus on in PDEP-13. There may be advantage of using "object" as the repr instead of "Object" as this could potentially simplify the transition to nullable types by default in future. So then I would think that using the more explicit "object_nullable" for now could maybe be better than introducing the capitalized form if it was agreed that the repr is just "object" without the subtleties of capitalization.

@simonjayhawkins
Copy link
Member Author

Also note that in the PDEP it was written "tentatively named "object_nullable"" based on the passage of PDEP-14 which needed a sub discussion to address. The words chosen purposely not to set the dtype string alias in stone and allowing the discussion to potentially avoid this debate in the main discussion thread.

@simonjayhawkins
Copy link
Member Author

it'd be good if you can provide information on how using a boolean mask compares to having pandas.NA inside the main array.

using a sentinel as opposed to a mask is an implementation detail that I can expand on. I'm assuming that this would still be a separate dtype from the traditional numpy dtype, a pandas nullable dtype? We have the string array backed by a masked object array that I was effectively proposing reusing/refactoring as a base class.

There is also another option which is maybe what you are proposing: making a breaking change to the exisiting numpy object dtype to handle pd.NA differently? This is perhaps what is in the rejected ideas section and that needs clarification?

No matter which of the two approaches above is considered, I think the arguments for using a mask as opposed to a sentinel are probably the same:

Using a Boolean mask is generally seen as the preferred design in pandas for extension arrays. It provides uniform missing-value handling, clearer data semantics, and is more in line with how extension types like "Int64" and "boolean" have been developed, not to mention the nullable string array which it is intended to re-use for the nullable object implementation.

This separation is one reason why the design of a nullable object dtype would benefit from a dedicated missingness mask rather than trying to “magically” interpret pd.NA embedded in an otherwise generic Python object array. IIUC pd.NA was designed as the representation of missing values in nullable arrays and was never intended to be an explicit sentinel value. I think many of the issues with pd.NA in object arrays arise from users thinking that the pd.NA object itself is a missing value and not a representation of a missing value. Of course a python object array can hold any object and so we can't, or maybe don't, stop users putting the pd.NA object in the traditional numpy object array.

To compare the two approaches:

Embedded pd.NA:

  • Maybe a simpler conceptual model for small or homogeneous arrays where overhead is minimal.
  • Checking each element at runtime might involve extra comparisons
  • When missing values are stored in the same array as valid data, ensuring that operations treat pd.NA consistently can be tricky.

Separate Boolean Mask:

  • Offers a clear and robust way to denote missing data while preserving clean data arrays; aligns with the design of other extension types in pandas; facilitates efficient and consistent missing data handling across operations
  • Introduces additional complexity in data structure design. This is not really an issue as there is no POC needed as the nullable object array shares so much code with the tried and tested pd.NA variant of the string array, available for a long time now.
  • Separating the missingness information from the actual data often leads to more readable, more performant and more maintainable code. When operations are performed, pandas can first consult the mask to identify missing elements and then process only the valid ones, or process only the missing values with operations such as fillna. In vectorized operations, many operations can short-circuit based on the mask.
  • The extra Boolean mask does add memory overhead, though future implementations could optimize the mask as a bitmask rather than a full Boolean array. This is applicable to all pandas nullable types and so that discussion would be outside the scope of this PDEP.

data and data that is better represented by a nullable
container supporting missing values.

This proposal is driven by frequent community discussions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there links for any of these? off the top of my head i dont recall any users asking for this in particular

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to list the issues in the OP before officially opening the discussion period, but yes this does need references. Thanks for highlighting this.

dtype to object_nullable using a constructor or an
explicit method (e.g., `pd.array(old_array,
dtype="object_nullable")`) using the existing api.
- Operations on existing pandas nullable dtypes that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples? the only one that comes to mind is concat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the str.split issue came to mind when I wrote this.

legacy object dtype does not provide a robust and
consistent solution, aligning with the design of other
extension arrays.
- Not having a nullable object dtype could potentially
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am i right in thinking this is the underlying motivation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried not too much to "sell" the concept until I get a temperature check from the other core devs. I'm happy to remove this.

polars has a nullable object type and some contributors seem to see polars as a threat or maybe just prefer us to match polars behavior. That could be a motivation if one is needed.

Ideally I would want to avoid object dtype where possible, PDEP-10 does that for many new datatypes. I opened this as it was looking like we will be rejecting PDEP-10 and if we have object dtype I think we should have a nullable one.

@jbrockmendel
Copy link
Member

A few API question it'd be helpful to see addressed explicitly:

  1. ser[0] = np.nan Does this assign NaN or does this silently replace with pd.NA?
  2. Same question with NaN in a list passed to the constructor.
  3. Same questions but with None or pd.NaT
  4. ser = pd.Series([pd.NA, None, pd.NaT, np.nan], dtype="object_nullable"). (Assuming we don't silently replace) What do isna/fillna/skipna do?

@datapythonista
Copy link
Member

I think @jbrockmendel questions are very good and worth having explicit in the proposal.

Personally I think pd.NA should be used for setting values in the boolean mask, everything else should go into the values and not have special treatment.

While it may be counterintuitive at first based on past behavior, I think it's the simplest to implement, to explain, and to understand.

@mroeschke
Copy link
Member

Just posting high level concerns:

  1. While pandas still object a "string dtype", would "object_nullable" be yet-another-string-dtype with an alternative NA semantics? I would be more comfortable with this type if pandas just considers object as purely PyObjects
  2. Maybe a meta comment about this topic, I do think discussing a "nullable object type" would be better suited in a PDEP that discusses all "nullable types" (system) to avoid fragmentation/diversion of terminology, null semantics, etc. as inevitability the other types will be discussed as well.

@datapythonista
Copy link
Member

Sorry if I miss it, but is the plan to implement the nullable object backed by numpy arrays only? I don't think we have an object dtype based on Arrow as Polars do, right?

Back to the discussion about naming, I think pyobject[pyarrow] and I guess pyobject[numpy] would be my preference. I think pyobject is more explicit and way more clear than object.

@simonjayhawkins
Copy link
Member Author

A few API question it'd be helpful to see addressed explicitly:

  1. ser[0] = np.nan Does this assign NaN or does this silently replace with pd.NA?
  2. Same question with NaN in a list passed to the constructor.
  3. Same questions but with None or pd.NaT
  4. ser = pd.Series([pd.NA, None, pd.NaT, np.nan], dtype="object_nullable"). (Assuming we don't silently replace) What do isna/fillna/skipna do?

what i've said in the initial draft is

The proposed nullable object array will be
unable to hold np.nan, None or pd.NaT as these will be
considered missing in the constructors and other conversions
when following the existing API for the other nullable
types. Users will not be able to round-trip between the
legacy and nullable object dtypes.

So I'm assuming that to ease the transition that all these will be treated as missing, updating the mask appropriately and represented as pd.NA. So my assumption is that we do silently replace. Do you disagree with this or perhaps prefer warnings for the assignment?

@simonjayhawkins
Copy link
Member Author

Personally I think pd.NA should be used for setting values in the boolean mask, everything else should go into the values and not have special treatment.

While it may be counterintuitive at first based on past behavior, I think it's the simplest to implement, to explain, and to understand.

seems reasonable. I assumed we would match the behavior of other nullable extension arrays for consistency. I'll audit this before opening the official discussion period.

Thanks for highlighting this.

@jbrockmendel
Copy link
Member

Oh one more:

  1. ser = pd.Series([2, pd.Timedelta(1)], dtype="nullable_object") # what happens with ser / 0?

@simonjayhawkins
Copy link
Member Author

2. Maybe a meta comment about this topic, I do think discussing a "nullable object type" would be better suited in a PDEP that discusses all "nullable types" (system) to avoid fragmentation/diversion of terminology, null semantics, etc. as inevitability the other types will be discussed as well.

yes I wanted to discuss this in PDEP-16 as the type mapping in the first commit showed that the traditional numpy object dtype would be retained. A few days away from a full year since that was opened and the PDEP in incomplete with no discussion. sitting on a draft PDEP is harming the project. If we can get PDEP-16 moving then I would probably not have opened this.

I think being part of the bigger discussion is crucial if the object_nullable was to become a default in the future. I did not explicitly state that in this initial draft even though I made the comment that could be interpreted as the motivation for this PDEP. As I said in #61599 (comment) i'm happy to remove that.

The motivation is just to create a dtype consistent with the other pandas nullable dtypes. We have Int, Float, Boolean etc but do not have one for object.

@simonjayhawkins
Copy link
Member Author

Oh one more:

  1. ser = pd.Series([2, pd.Timedelta(1)], dtype="nullable_object") # what happens with ser / 0?

under the current proposal these would be missing in the mask (represented as the pd.NA scalar). if we instead do what @datapythonista suggests in #61599 (comment) then this would yield np.nan and would be a sentinel value.

@simonjayhawkins
Copy link
Member Author

  1. While pandas still object a "string dtype", would "object_nullable" be yet-another-string-dtype with an alternative NA semantics? I would be more comfortable with this type if pandas just considers object as purely PyObjects

to be compatible with pyArrow and polars?

@mroeschke
Copy link
Member

While pandas still object a "string dtype", would "object_nullable" be yet-another-string-dtype with an alternative NA semantics? I would be more comfortable with this type if pandas just considers object as purely PyObjects

to be compatible with pyArrow and polars?

Sorry had some typos in that statement. I meant to say "While pandas still considers object a "string dtype"".

Well there no compatibility with those two since they don't have a type to store arbitrary PyObjects. I think it's a potential "strength" that pandas can provide a "nullable object dtype" to store PyObjects with null semantics. My concern is pandas still conflating object with "string".

So for example, str APIs still work with object type. Would you expect str APIs to work with object_nullable?

@simonjayhawkins
Copy link
Member Author

The motivation is just to create a dtype consistent with the other pandas nullable dtypes.

Behind a flag. Not necessarily making any commitment here to changes to the existing pandas api. To allow evaluation of the concept as we did with the other pandas nullable dtypes at first. The StringDtype has been available for a long time now before being made a pandas default type. I expect object_nullable to be experimental for a long time and to iron out any consistency issues with the other pd.NA variants of the pandas nullable dtypes and consistency issues with the legacy object dtype with the handling of values that are considered missing as @jbrockmendel highlighted.

@simonjayhawkins
Copy link
Member Author

While pandas still object a "string dtype", would "object_nullable" be yet-another-string-dtype with an alternative NA semantics? I would be more comfortable with this type if pandas just considers object as purely PyObjects

to be compatible with pyArrow and polars?

Sorry had some typos in that statement. I meant to say "While pandas still considers object a "string dtype"".

Well there no compatibility with those two since they don't have a type to store arbitrary PyObjects. I think it's a potential "strength" that pandas can provide a "nullable object dtype" to store PyObjects with null semantics. My concern is pandas still conflating object with "string".

Ah i see, I interpreted that as we could have an arrow array containing the pointers to the python objects.

the nullable string array is an nullable object array with constraints. The nullable object array would not be a string array. the constraints are it's strengths as we don't have the np.nan, None, pd.NAT issue?

So for example, str APIs still work with object type. Would you expect str APIs to work with object_nullable?

only if object dtype retains the str accessor. If we remove it from object then it would not be needed from "object_nullable"

@mroeschke
Copy link
Member

The nullable object array would not be a string array.

+1, and hopefully that can give us inspiration to decouple the base object dtype from meaning "string" in the future (not for this PDEP)

the constraints are it's strengths as we don't have the np.nan, None, pd.NAT issue?

That and a general observation that users continue to store arbitrary objects (numpy arrays, pandas objects, custom classes) in pandas objects, so providing nullability to that I guess is a plus

@simonjayhawkins
Copy link
Member Author

While pandas still object a "string dtype", would "object_nullable" be yet-another-string-dtype with an alternative NA semantics?

hopefully not.

The intention was an array with exactly the same NA semantics of the other pd.NA variants of the pandas nullable types, i.e. the original nullable string dtype, Int, Float, Bool etc. Should be no differences in my opinion. The issues regarding handing of np.nan are perhaps similar to the array of issue for the pandas nullable float type if we allow it. If we don't it's not an issue but the behavior of the object and object_nullable would be different. In an ideal world we would perhaps want as much backwards compatibility as possible.

If we don't have pyarrow as a required dependency, then I would expect object backed variants of the new dtypes. These would presumably be based on an nullable object array with constraints. For these cases the constraints would be their strength. I've not included that as a motivation either at this point. If the discussion here exceeds the PDEP-10/PDEP-15 discussion which I expect it to following the timescales in PDEP-1 that could be added depending on the outcome. I assuming PDEP-10 is rejected. If PyArrow is made a required dependency for 3.0 then maybe we could shift focus to a nullable object array backed by pyarrow pointers to python objects instead.

@simonjayhawkins
Copy link
Member Author

So for example, str APIs still work with object type. Would you expect str APIs to work with object_nullable?

only if object dtype retains the str accessor. If we remove it from object then it would not be needed from "object_nullable"

I see that discussing accessors is a glaring omission from my draft.

@simonjayhawkins
Copy link
Member Author

There can surely be other considerations, but so far +1 on not supporting .str (or other accessors) in the new type.

Interesting idea. The proposed nullable object dtype with a list accessor would be well suited to the return type of str.split(expand=False) of the pd.NA variant of the pandas nullable string dtype.

@simonjayhawkins
Copy link
Member Author

In that sense, what it makes sense to me is the the PyObject type provides the functionality to "fix" the data, so it can be transformed and converted to another type with the specific functionality. Feels like .map(), .astype() and not much more should be enough if I'm not missing anything. I don't think .map() to run arbitrary Python code on Python objects is an unresonable choice, and since they are PyObjects anyway, I don't think it should be significantly slower than the accessor methods.

That's also an interesting idea. Designing a nullable object array from scratch instead of trying to match all the functionality of the current numpy object array!

@datapythonista
Copy link
Member

That's also an interesting idea. Designing a nullable object array from scratch instead of trying to match all the functionality of the current numpy object array!

For context, I do think we're requiring PyArrow in pandas 3.0, and using PyArrow types by default. My comments here are based on it. Unfortunately we need to wait an extra month to find out. But spending the next two years reinventing the nullable types that PyArrow already give us for free seems a very poor investment of our time. And even if we do that, I'd do it by continuing to release 2.x versions. We'll continue this discussion in the appropriate threads, just wanted to clarify that we are probably discussing this new PyObject type from very different points of view, as PDEP-15 is keeping the direction of the project very uncertain.

@simonjayhawkins
Copy link
Member Author

For context, I do think we're requiring PyArrow in pandas 3.0, and using PyArrow types by default.

Where has this been discussed? My understanding is that the PyArrow types are just experimental and the pandas nullable types will be the default in the future (as per PDEP-16 when its done)

@datapythonista
Copy link
Member

This can't be discussed until there is clarity about PDEP-10 and PDEP-15 in my opinion. And it's been like 3 years of PyArrow types being experimental anyway, so I think it's something worth discussing and considering if we do require PyArrow, no?

@simonjayhawkins
Copy link
Member Author

I read the above as using PyArrow types by default in 3.0. That can't happen unless we postpone the 3.0 release and as per our deprecation policy have at least 2 minor releases in 2.x with the appropriate warnings for the breaking changes.

@jbrockmendel
Copy link
Member

I read the above as using PyArrow types by default in 3.0

I don't think I've seen that seriously considered/discussed anywhere. There's a path to making it feasible to use pyarrow types by default, but that path probably takes multiple major release cycles. Doing it for 3.0 would be, frankly, insane.

@simonjayhawkins
Copy link
Member Author

Doing it for 3.0 would be, frankly, insane.

I was initially shocked by the statement. I thought I had missed some significant discussion. Glad I misread it.

There's a path to making it feasible to use pyarrow types by default...

Where has this been discussed?

@jbrockmendel
Copy link
Member

Oh one more:

ser = pd.Series([2, pd.Timedelta(1)], dtype="nullable_object") # what happens with ser / 0?

under the current proposal these would be missing in the mask (represented as the pd.NA scalar). if we instead do what @datapythonista suggests in #61599 (comment) then this would yield np.nan and would be a sentinel value.

The "current proposal" option here is inconsistent with the Floating/Integer behavior. But it is consistent with the setitem/constructor behavior. Is everyone OK with that?

@jbrockmendel
Copy link
Member

There's a path to making it feasible to use pyarrow types by default [...]

Where has this been discussed?

I'm not aware that is has been. I'm just claiming that it a path exists. A long, difficult path.

@simonjayhawkins
Copy link
Member Author

with some major changes to the make up of the active contributors in the past few years, it may be that a high proportion of the current active contributors favor the ArrowDtype over the numpy backed pandas nullable arrays. Maybe we need to have this discussion sooner rather than later? I've always assumed that the pandas nullable types would be the future defaults and not challenged that. But if that is not the direction the majority of the voting members now want moving forward is going to be more difficult.

@datapythonista
Copy link
Member

Sorry Simon I ended up hijacking the discussion here. I created #61618 to discuss about PyArrow default types, so we can keep this PR / PDEP focused on its goal.

@simonjayhawkins
Copy link
Member Author

Thanks @datapythonista for opening #61618. I just seek clarity and will surely consider the proposal, especially as PDEP-16 has failed to involve the community and other core developers on an "agreement" made 2 years ago.

@simonjayhawkins
Copy link
Member Author

so we can keep this PR / PDEP focused on its goal.

yes, I made some assumptions in the proposal about the future direction of pandas. PDEP-15 and PDEP-16 in particular. And not at all hostile to the content of PDEP-16, only the frustration at not moving to a vote to date so that the PDEP is accepted and the item is on our "roadmap".

I've had discussions with @WillAyd when reviewing the pandas cookbook related to the future direction of pandas which is not clear without approved PDEPs and roadmap items. The cookbook promotes the pandas nullable dtypes as pandas preferred approach but I'm sure Will would have preferred to use the ArrowDTypes in the examples. Will, be sure to correct me if i'm in any way misinterpreting/misrepresenting you.

So with respect to PDEP-16, it is clarity that I seek to reduce the potential harm to the project of core dves pulling in two different directions.

This proposal was intended to align with what I assumed were the goals of PDEP-16, i.e. a transition path to pandas nullable type by default.

@WillAyd
Copy link
Member

WillAyd commented Jun 10, 2025

I've had discussions with @WillAyd when reviewing the pandas cookbook related to the future direction of pandas which is not clear without approved PDEPs and roadmap items. The cookbook promotes the pandas nullable dtypes as pandas preferred approach but I'm sure Will would have preferred to use the ArrowDTypes in the examples. Will, be sure to correct me if i'm in any way misinterpreting/misrepresenting you.

Kind of...I ultimately just wanted to teach people a way that they could easily and consistently express the types of their data in pandas and operate on it. Completely ignoring the numpy versus extension versus arrow types unfortunately was rather challenging, because the choice of any of those can have drastic implications in terms of performance, usability, and subsequent type usage. I personally find that unfortunate and would prefer we get out of caring about the physical implementation of a type. PDEP-14 was my aim to address that, but I have since abandoned it.

A few API question it'd be helpful to see addressed explicitly:

1. `ser[0] = np.nan`  Does this assign NaN or does this silently replace with pd.NA?

FWIW we had a lengthy discussion about this in PDEP-16 already with respect to floating point data types, and the idea was that even though its less "correct," given our history assigning np.nan should still assign a missing value. See #58988 (comment) - I think that same logic should be held here.

Overall on the proposal I think there's some merit to doing this. My main hesitation is how we guide users into using new types like this. A good example would be users that truly have list-like or dict-like data; we have a potential as a library to better serve those types, especially if we follow through on PDEP-10. Of course, today people are using object type to store those values and this wouldn't change that, but I feel like we are sending them down a winding path if we tell them to use Object going forward, get them used to the semantics there, and then try to port them to a List or Dict type in the future

@simonjayhawkins
Copy link
Member Author

but I feel like we are sending them down a winding path if we tell them to use Object going forward, get them used to the semantics there, and then try to port them to a List or Dict type in the future

My intention was that the nullable object array is experimental and I agree that this transition path is a bad idea. That was not an intention of the proposed dtype or a suggested path but I understand a concern that if the dtype was available some users would use it for that purpose. My assumption is that if would be a base class for object fallback nested datatypes if PyArrow remained an optional dependency.

My preference is to avoid object dtype where possible in the first instance, adding the new datatypes in PDEP-10 and then using "nullable_object" for what's left. So I envisioned a different transition path of object to new datatypes followed by remaining object to "object_nullable"

There seems to be some interest though in a Pyarrow backed nullable object array and I had not considered that as part the proposal, only from a perspective of the missing value indicator being pd.NA that "object_nullable" would be more compatible with the ArrowDtypes and avoid mixed missing value semantics for arrow users with some object data columns.

@simonjayhawkins
Copy link
Member Author

Oh one more:

  1. ser = pd.Series([2, pd.Timedelta(1)], dtype="nullable_object") # what happens with ser / 0?

@jbrockmendel this is a more interesting example.

The other 4 fall squarely in the domain of missing value handling.

This one could be considered more conceptional and could drive the motivation for the need (or not) of a nullable object array.

Many discussions revolve around the use of object for types that aren't supported natively in the columnar context. Hence presumably the suggestions to provide a much simpler api that just allows mapping of the objects to a supported type using UDF.

However this example effectively shows a heterogeneous array of supported data types, often a result of selecting a row, or other operations or transformations using axis argument. Now I'm not sure how that's handled today with the ArrowDtype system.

I think that this example potentially supports the current proposal that the nullable object type would need to match the behavior of the legacy object dtype as much as possible and I think that would also require all the accessors available on the numpy object array in order to selectively process heterogeneous arrays of supported types.

@simonjayhawkins
Copy link
Member Author

simonjayhawkins commented Jun 10, 2025

df = pd.DataFrame(
    {
        "Int64": pd.Series([1, 2, pd.NA], dtype="Int64"),
        "string": pd.Series(["a", "b", pd.NA], dtype="string"),
    }
)
df.iloc[2]
# Int64     <NA>
# string    <NA>
# Name: 2, dtype: object

should pandas be putting <NA> into an traditional numpy object array, or should this be np.nan (or even None or pd.NaT)?

@WillAyd
Copy link
Member

WillAyd commented Jun 10, 2025

should pandas be putting <NA> into an traditional numpy object array, or should this be np.nan (or even None or pd.NaT)?

This is a good example, and also something where I'd pitch that a logical type system with a "Missing" type would serve us better than trying to stuff this into an object type

@jbrockmendel
Copy link
Member

a "Missing" type

In the past when the idea of a missing dtpye has come up it has been in the context of pd.Series() with no data passed. The main use case is in pd.concat where any entries with missing dtype would be ignored when determining the result dtype. IIUC what you are suggesting is distinct from that?

@WillAyd
Copy link
Member

WillAyd commented Jun 10, 2025

The case @simonjayhawkins just showed where you can have an array of all missing values. I wouldn't expect a user to create that directly, but it can certainly happen throughout a sequence of operations. Its easier to have a dedicated missing type, inspect its metadata as it is passed along, and branch accordingly, rather than having an object-dtype and then having to inspect each element to determine if it is all missing

@jbrockmendel
Copy link
Member

Uhh if the motivating use case is the example simon posted of df.iloc[2] i definitely don't think the dtype should depend on whether the values are all-missing. We want to avoid value-dependent behavior. Besides I read simon's comment as suggesting he wanted to be able to distinguish between a float-missing and string-missing.

@simonjayhawkins
Copy link
Member Author

Besides I read simon's comment as suggesting he wanted to be able to distinguish between a float-missing and string-missing.

Ignoring PDEP-16 for now, I think I was expecting the current behavior using numpy missing value semantics would in this case be np.nan for the Int64 value and np.nan/None for the string (to match legacy string which was object) and pd.NaT if the columns had be temporal. This being the assumption based on the return type being traditional numpy array and so honoring the typical missing values used in object. If we had a pandas nullable object dtype then pd.NA would be fine? but pd.NA in object doesn't work so well.

@simonjayhawkins
Copy link
Member Author

but pd.NA in object doesn't work so well.

and IIRC I have in some cases been telling users not to do it when they submit a bug report. So I was surprised that pandas can produce pd.NA in object type.

so for example #60049, when the user expects the behavior of a nullable boolean by inserting a pd.NA value, i've tended to tell them that their expectations and usage are incorrect.

@simonjayhawkins
Copy link
Member Author

but pd.NA in object doesn't work so well.

the first case in the issue linked in the PDEP #32931 is

pd.Series([1, pd.NA], dtype=object) >= 1
# 0     True
# 1    False
# dtype: bool

so as a temperature check do others feel that putting pd.NA in an object array is inappropriate usage? perhaps we shouldn't allow it? or should we handle it correctly?

@WillAyd
Copy link
Member

WillAyd commented Jun 10, 2025

so as a temperature check do others feel that putting pd.NA in an object array is inappropriate usage? perhaps we shouldn't allow it? or should we handle it correctly?

I think there is very little value in trying to change this; the semantics of this are really unclear and overloaded in a variety of contexts. Its just a gap of the existing type system

@simonjayhawkins
Copy link
Member Author

and IIRC I have in some cases been telling users not to do it when they submit a bug report. So I was surprised that pandas can produce pd.NA in object type.

#61182 is a case where the issue may have been incorrectly closed

@simonjayhawkins
Copy link
Member Author

Its just a gap of the existing type system

not really, the existing type system didn't suffer from any of the issues until pd.NA was introduced. And users expect pd.NA to be a pandas missing value. I'm not prepared to dismiss all the issues regarding pd.NA in object type so easily. There are issues now dating back over 5 years and hoping they will all go away when we adopt nullable types by default someday does, to me, not seem to be a good plan.

@simonjayhawkins
Copy link
Member Author

However this example effectively shows a heterogeneous array of supported data types, often a result of selecting a row, or other operations or transformations using axis argument. Now I'm not sure how that's handled today with the ArrowDtype system.

I'm seeing numpy object dtype for row selection with Arrow types. This surely can't be right for the ArrowDtype system? What's the plan if we go for arrow by default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP pandas enhancement proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants