Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow opting in to new dtypes on I/O routines via keyword to I/O routines #29752

Closed
Dr-Irv opened this issue Nov 20, 2019 · 60 comments
Closed
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. IO Data IO issues that don't fit into a more specific label NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 20, 2019

With the new dtypes (IntegerArray, StringArray, etc.), if you want to use them when reading in data, you have to specify the types for all of the columns. It would be nice to have the option to use the new dtypes for all columns as a keyword to read_csv(), read_excel(), etc.

(ref. discussion in pandas dev meeting on 11/20/19)

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Nov 23, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Nov 23, 2019
@TomAugspurger
Copy link
Contributor

@jorisvandenbossche is this likely for 1.0? My initial preference is to add this in the future (as an option in 1.1 say)

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.0, 1.1 Dec 21, 2019
@jorisvandenbossche
Copy link
Member

It's certainly not release critical, so let's remove from the milestone.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 21, 2019

So do have an idea how we would like to tackle this? I think also the constructors (eg DataFrame(..)) could use a similar option.

An option like use_new_dtypes=True/False consistently across functions that create dataframes/series?
Or a better name, as "new" is not very descriptive. use_nullable_dtypes might not be fully covering for eg strings, as those were already nullable before.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Dec 23, 2019

I'd like to argue that if you want to get people to use the new dtypes, and especially pd.NA, then this becomes pretty important, because IMHO, most missing values are introduced when people read in data. So if you don't change the I/O routines, pd.NA is unlikely to get used.

As for names, maybe use_extension_dtypes ??

@jorisvandenbossche
Copy link
Member

What I dislike about use_extension_dtypes it that it sounds as extension to pandas, which is here not the case. The hope is that at some point those are the default dtypes.
(I know, the whole thing is called Extension.., but still).

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Dec 23, 2019

How about use_distinct_dtypes ?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 2, 2020

@jorisvandenbossche I started to look at this, and if we did it for each reader, it might end up being a lot of work because of all of the different reader implementations. Here's another proposal. What if we created a method DataFrame.as_nullable_types() that would take a DataFrame and convert any column it could to a nullable type. Then if you used any reader, which didn't use the new types, you could convert the entire DataFrame in one line, so you could have df = pd.read_csv('filename.csv').as_nullable_types() or df = pd.read_excel('filename.excel').as_nullable_types(), etc. The rules could look something like this:

  1. If dtype is object, convert to string. If it fails, leave it alone.
  2. If dtype is float, try conversion to boolean. If that fails, try conversion to Int64. If that fails, leave it alone.

My goal here is to make it easy for people to use the new StringDType, Int64DType, and BooleanDType. If we don't do something like this, I don't think those types will get exercised, because missing values are typically encountered when reading data, and it is painful to have to specify the dtype for each column when reading data with lots of columns.

@jorisvandenbossche
Copy link
Member

if we did it for each reader, it might end up being a lot of work because of all of the different reader implementations

If we want to have efficient methods, we will probably need to end up with reader-specific implementation anyway, I think. But, that doesn't mean of course that all of the readers need to support it natively, we can start with the important ones and have others convert it after reading. For example, I was planning to work on a parquet reader that directly gives you nullable integers (which avoids a copy and an unneeded roundtrip to float). All to say: I think it is still useful to have it as a reader option as well in some cases.

But that said, a helper method/function to convert an existing dataframe into a new dataframe using nullable types sounds like a good idea, that will be useful anyway.
Conceptually, it is somewhat similar as DataFrame.infer_objects ("Attempt to infer better dtypes for object columns.", except here we want to infer better dtypes for all columns)

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 7, 2020

@jorisvandenbossche I'll work on the helper method, as I think it should be in 1.0, and then later we can figure out how to change the various readers (and in which order) for a later version.

@jorisvandenbossche
Copy link
Member

I put up a PR implementing a use_nullable_dtypes option for read_parquet specifically: #31242

@jorisvandenbossche
Copy link
Member

In the PR adding an option to read_parquet, we are having a discussion which is getting more general about such an option in IO, and more general about what to expect from such an option (not only the name), so moving it here.

@jreback and @WillAyd mentioned they rather prefer use_extension_dtypes than use_nullable_dtypes, to which I replied:

For me, the main reason to not use use_extension_dtypes is: 1) this option does not trigger to return extension dtypes in general. For example, it does not trigger to return categorical or datetimetz (as those are aready returned by default by pyarrow), and it does not trigger to return period or interval (those can be returned based on metadata saved in the parquet file / pyarrow exension types); in both cases, extension types will be returned even with use_extension_dtypes=False. In contrast, I find use_nullable_dtypes clearer in communicating the intent*.
In addition, and more semantically, "extension" types can give the idea of being about "external" extension types (but this is a problem in general with the term, so not that relevant here).

*I think we are going to need some terminology to denote "the dtypes that use pd.NA as missing value indicator". Also for our communication (and when discussing) about it, for in the docs, etc, it would be good to have a term for it that we can consistently use. I think "nullable dtypes" is an option for this (we already use "nullable integer dtype" for a while in the docs), although certainly not ideal, since strictly speaking other dtypes are also "nullable" (floats, object, datetime), just in a different way.
Maybe having this more general discussion can help us find matching keyword names afterwards.

@jorisvandenbossche
Copy link
Member

reply of @WillAyd

Sure as some quick counter arguments:

  • The semantics are unclear to an end user; I would think most consider np.float to be nullable which this wouldn't affect
  • Some of the arguments for its clarity are specific to parquet, but I think become more ambiguous if we reuse the same keyword for other parsers (which I hope we would)
  • If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

The third point would probably the one I think is most of an issue

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 23, 2020

reply of @WillAyd

  • If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

The third point would probably the one I think is most of an issue

I agree this is a compelling argument to not use as_nullable_dtypes. Here are some ideas:

  • as_NA_dtypes (ones supporting pd.NA)
  • as_modern_dtypes (since anything we'd want to support would be the most modern ones)
  • as_endorsed_dtypes (then we can determine which are endorsed/recommended ones)

Hoping that stimulates the discussion!

@jorisvandenbossche
Copy link
Member

Thanks @WillAyd for those arguments. From that, it seems we still need to discuss / clarify what the exact purpose is we want to achieve with such an option

The semantics are unclear to an end user; I would think most consider np.float to be nullable which this wouldn't affect

Yes, that's what I mentioned about "nullable" not being ideal. But that's a general issue for speaking about those dtypes. And as mentioned above, I think we need to find some term for that.
If we clearly define what we mean with "nullable dtype" in the docs and use it consistently throughout the docs for that purpose, I think a term like that can work (IMO, it's at least better than no consistent term).
Also, at some point we might want to have a float dtype that uses pd.NA as missing value. So also then we need a term to distinguish it from the "classic" float dtype ("nullable float dtype" ?)

Some of the arguments for its clarity are specific to parquet, but I think become more ambiguous if we reuse the same keyword for other parsers (which I hope we would)

Parquet is one of the formats that has the most type information, so the distinction between extension types in general and nullable types in specific is indeed most relevant there. When reading eg csv you indeed can't get categoricals. But it's not fully limited to parquet. read_feather and read_orc are also based on pyarrow, so have the same type support. You can get categoricals from read_stata and read_spss, you can get datetimetz from read_sql and (maybe?) read_excel

If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

Yes, if those new extension types don't use pd.NA as missing value indicator, they would purposefully not fall under this keyword. This issue here is really specifically about those dtypes using pd.NA, as those have a different behavior for operations involving missing values.
Now, I would personally argue that we shouldn't add new extension dtypes that don't use pd.NA, but that's another discussion. It's also difficult to discuss such a hypothetical case; one concrete example that has come up is something struct/json like: those probably can't be stored in typical file formats like csv anyway (and also, we could probably use pd.NA as missing value there).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 23, 2020

If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

The third point would probably the one I think is most of an issue

I agree this is a compelling argument to not use as_nullable_dtypes.

To highlight a single point of my long post above that was answering on this aspect: IMO, if such a new dtype is not using pd.NA, it should not fall under this keyword. So if we do that (debatable of course), then whathever name we come up (like use_NA_dtypes) with will have the exact same problem.

@WillAyd
Copy link
Member

WillAyd commented Jan 23, 2020

Slightly different direction but what about something like na_float_cast=True as a default? I think clearer on intention and also doesn't force to use an extension dtype unless NA values are actually detected, which could help save memory

@jorisvandenbossche
Copy link
Member

Slightly different direction but what about something like nan_float_cast=True as a default? I think clearer on intention and also doesn't force to use an extension dtype unless NA values are actually detected, which could help save memory

That depends on the behaviour of the option. Right now, I think the intent was to use eg nullable integer dtype for all integer columns, not only those integer columns that have missing values and would otherwise be casted to float. (Also, for booleans, they get casted to object right now if there are floats)

It's true that not doing it for all columns can save memory, but personally I would prefer doing it for all columns: 1) you get a consistent result depending on the "logical" type of your column, not on the presence of missing values (eg if only reading in a part of the file, this can already differ 2) missing values can also be introduced after reading (for eg reindexing, merge, ..) and then having a nullable integer dtype ensures it doesn't get cast to float then, even if the original data didn't have nans

@WillAyd
Copy link
Member

WillAyd commented Jan 23, 2020

That depends on the behaviour of the option. Right now, I think the intent was to use eg nullable integer dtype for all integer columns, not only those integer columns that have missing values and would otherwise be casted to float. (Also, for booleans, they get casted to object right now if there are floats)

Yea I agree - a clarification on that intent definitely drives this.

I don't think it's worth adding the mask unless needed - it can certainly have non-trivial memory impacts. If my simple math is right for a 10 million row by 10 column
block of integer values adding the mask would require at least 100 MB more in memory

@jorisvandenbossche
Copy link
Member

I don't think it's worth adding the mask unless needed

But rather than limiting the columns that would get converted to masked / nullable dtypes with the option under discussion here, I would rather try to solve this memory concern by improving the implementation. The concrete ideas we have for this: 1) make the mask optional, so it can be None of there are no missing data (this should not be too hard to do I think) 2) investigate using a bitmask instead boolean array mask (this is probably harder, as there is no standard implementation of this in Python, so that will need some custom code).

Note that the nullable dtypes are still experimental anyway (there are quite some operations that don't work yet, there are things that are slower, ..), so I think this option will in an initial phase mainly be for allowing to easily experiment, try it out. And for such a use case, I think it is more useful to convert all possible columns instead of addressing the memory concern by not converting all columns.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 24, 2020

From @WillAyd

  • If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

I had another thought on this. Because of a method like Series.shift() that creates entries with NA in it, I think that any new extension type always needs to do something about NA, and IMHO, I think we would want them to use pd.NA and not np.nan to represent a "missing value"

Which may mean that a keyword such as use_missing_value_dtype might make sense, although it's a lot to type.

And if we really want to stress that this is all about missing values, using pd.MV instead of pd.NA might help get that point across, but that's probably opening up a whole other can of worms.

@jorisvandenbossche
Copy link
Member

@WillAyd regarding the memory issues of masked arrays: there is #30435 about making the mask optional and #31293 about exploring bitarrays for the mask.

a keyword such as use_missing_value_dtype might make sense

Since np.nan in float dtype is also used as "missing value", I am not sure this is less ambiguous than use_nullable_dtype (given the argument against use_nullable_dtype that there are other dtypes that are also "nullable" without using pd.NA).

as_NA_dtypes (ones supporting pd.NA)

This is quite explicit!
For me, a drawback of this one is that I personally find that it sounds less good when using it as the general terminology to speak about this (like "the NA dtypes" in prose text).


Another option is convert_dtypes=True/False. I don't think it is very clear from the name what it would do, but that is what we ended up with for the method name in #30929

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 24, 2020

a keyword such as use_missing_value_dtype might make sense

Since np.nan in float dtype is also used as "missing value", I am not sure this is less ambiguous than use_nullable_dtype (given the argument against use_nullable_dtype that there are other dtypes that are also "nullable" without using pd.NA).

as_NA_dtypes (ones supporting pd.NA)

This is quite explicit!
For me, a drawback of this one is that I personally find that it sounds less good when using it as the general terminology to speak about this (like "the NA dtypes" in prose text).

We could combine the two ideas, i.e., as_missing_value_NA_dtypes, and you say "the missing value NA dtypes" in prose texts to mean the dtypes that represent missing values using pd.NA

Another option is convert_dtypes=True/False. I don't think it is very clear from the name what it would do, but that is what we ended up with for the method name in #30929

As the author of the above, I would vote against that in the I/O context because convert_dtypes also has the inherit_objects behavior, so now we have different meanings of convert_dtypes in two different contexts.

@jorisvandenbossche
Copy link
Member

and you say "the missing value NA dtypes" in prose texts to mean the dtypes that represent missing values using pd.NA

For me that is fine if that's a compromise that most people can live with. In that case we should update the doc sections on "Nullable integer data type" to something like "Integer data type with NA missing value" or .. (that's a bit long for a title though)

But personally, I would just propose: let's define "nullable" as "dtype that uses NA" in context of the pandas docs / dtypes in pandas. It's a term we didn't use for anything else up to now (we otherwise don't use it when talking about missing values, all occurrences in the docs of this word are about the new dtypes)

@jorisvandenbossche
Copy link
Member

Another friendly ping ..

@WillAyd How strong is your preference? Maybe a bit difficult to exactly answer, but meaning: I also still have a preference for use_nullable_dtypes, and since there is a majority of participating voices OK with that, I would still like to go with use_nullable_dtypes if your preference is not too strong.
(and since I am the one pushing for it, it's a bit hard for me to make the final decision ...)

For the keyword itself, I am relatively OK with use_NA_dtypes as well, actually. But for running text in the documentation etc, I would rather prefer speaking about "nullable dtypes" (as we already do right now, actually). And if we do that in the docs, I think we should be consistent with the keyword as well.

@WillAyd
Copy link
Member

WillAyd commented May 29, 2020

I continue to think that use_NA_dtypes is better particular once we add the floatNA types. I won't belabor the point though

@jreback was also dissenting on this so should see where he stands and go from there

@jorisvandenbossche
Copy link
Member

@WillAyd in that case, could you then answer to my question about what you would do with the docs?

@WillAyd
Copy link
Member

WillAyd commented May 29, 2020

Sure I think referring to them as NA dtypes is clearer than Nullable dtypes

@jreback
Copy link
Contributor

jreback commented May 29, 2020

have come around here, ok with use_nullable_dtypes this matches our current doc descriptions.

@TomAugspurger
Copy link
Contributor

Do we consider this a blocker for 1.1? If so, anyone want to work on it?

@jreback
Copy link
Contributor

jreback commented Jun 17, 2020

i think we merge the current proposal

@TomAugspurger
Copy link
Contributor

Anyone (@Dr-Irv, @jorisvandenbossche) able to work on this? This seems worth doing for 1.1 if it only takes a few days.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jul 6, 2020

@TomAugspurger For me, it won't take a few days, because I think the changes should be made at a pretty low level in the readers, and I'd have to figure out how that code works. The easy solution is to use convert_dtypes inside the various readers after the current read operations, but that would be inefficient from a memory standpoint.

@chrish42
Copy link
Contributor

chrish42 commented Aug 5, 2020

I was sent here from issue #35576. (Although #29752 (comment) is maybe suggesting that this does need to be a separate discussion?)

Personally, the thing I care most about "turning off" in this upcoming edition of "Modern Pandas" is the fallback to the object dtype. (Because it makes things orders of magnitude slower, and it's pretty easy for it to happen "for you" behind the scenes). Yes, some of that is caused by needing pd.NA for a numpy dtype that doesn't support it, but not all cases are from that.

As a very concrete use case, I would want a way for read_csv() to always use StringDtype instead of the object dtype. (That would simplify my life a good bit.) But it's not clear to me "nullable dtypes" designation applies to that. In my mind, the object dtype is certainly nullable, no? So, I wouldn't think of use_nullable_dtypes=True as affecting that, personally. Thoughts?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Aug 5, 2020

@chrish42 Look at this comment: #29752 (comment)

The definition of a "nullable dtype" is "a pandas dtype supporting pd.NA". That includes StringDtype but not object, so once this gets implemented, you'd be able to get StringDtype as the result of pd.read_csv()

@chrish42
Copy link
Contributor

chrish42 commented Aug 6, 2020

@Dr-Irv Cool, thank you. That wasn't immediately clear to me. At least, unlike other times, there's no reason why object couldn't support pd.NA, right?. And for me, the main downside with the use_nullable_types name is that even folks that don't need nullable types would benefit from setting it to True. (Well, pretty much everyone would benefit from setting it to True.) But I guess there's no perfect name here, and good documentation will have to do the rest of the job and convey to users what the name isn't conveying.

Anyways, really looking forward to the day when automatic conversions to object (because strings) and to float (because NA) are a thing of the past. So thank you all!

@TomAugspurger
Copy link
Contributor

there's no reason why object couldn't support pd.NA, right?.

I'd recommend against it. Algorithms need to be written to explicitly handle NA since it's so unusual.

@simonjayhawkins
Copy link
Member

there's no reason why object couldn't support pd.NA, right?.

I'd recommend against it. Algorithms need to be written to explicitly handle NA since it's so unusual.

see #32931 for dedicated issue

@jorisvandenbossche
Copy link
Member

@chrish42 if one is only interested in getting the new string dtype (to avoid object dtype), it's certainly true that the use_nullable_dtypes=True is not really obvious (I think that is also one of the reasons for the long discussion above).

But, we certainly want a keyword to opt in to all nullable dtypes, so eg also nullable int and nullable bool to avoid casting to float etc. And then the name makes more sense. Adding yet another keyword for just getting string dtype is then probably too much.

@phofl
Copy link
Member

phofl commented Sep 30, 2023

I think this is done now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. IO Data IO issues that don't fit into a more specific label NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests