-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow opting in to new dtypes on I/O routines via keyword to I/O routines #29752
Comments
@jorisvandenbossche is this likely for 1.0? My initial preference is to add this in the future (as an option in 1.1 say) |
It's certainly not release critical, so let's remove from the milestone. |
So do have an idea how we would like to tackle this? I think also the constructors (eg An option like |
I'd like to argue that if you want to get people to use the new dtypes, and especially As for names, maybe |
What I dislike about |
How about |
@jorisvandenbossche I started to look at this, and if we did it for each reader, it might end up being a lot of work because of all of the different reader implementations. Here's another proposal. What if we created a method
My goal here is to make it easy for people to use the new |
If we want to have efficient methods, we will probably need to end up with reader-specific implementation anyway, I think. But, that doesn't mean of course that all of the readers need to support it natively, we can start with the important ones and have others convert it after reading. For example, I was planning to work on a parquet reader that directly gives you nullable integers (which avoids a copy and an unneeded roundtrip to float). All to say: I think it is still useful to have it as a reader option as well in some cases. But that said, a helper method/function to convert an existing dataframe into a new dataframe using nullable types sounds like a good idea, that will be useful anyway. |
@jorisvandenbossche I'll work on the helper method, as I think it should be in 1.0, and then later we can figure out how to change the various readers (and in which order) for a later version. |
I put up a PR implementing a |
In the PR adding an option to @jreback and @WillAyd mentioned they rather prefer For me, the main reason to not use *I think we are going to need some terminology to denote "the dtypes that use |
reply of @WillAyd Sure as some quick counter arguments:
The third point would probably the one I think is most of an issue |
I agree this is a compelling argument to not use
Hoping that stimulates the discussion! |
Thanks @WillAyd for those arguments. From that, it seems we still need to discuss / clarify what the exact purpose is we want to achieve with such an option
Yes, that's what I mentioned about "nullable" not being ideal. But that's a general issue for speaking about those dtypes. And as mentioned above, I think we need to find some term for that.
Parquet is one of the formats that has the most type information, so the distinction between extension types in general and nullable types in specific is indeed most relevant there. When reading eg csv you indeed can't get categoricals. But it's not fully limited to parquet.
Yes, if those new extension types don't use pd.NA as missing value indicator, they would purposefully not fall under this keyword. This issue here is really specifically about those dtypes using pd.NA, as those have a different behavior for operations involving missing values. |
To highlight a single point of my long post above that was answering on this aspect: IMO, if such a new dtype is not using pd.NA, it should not fall under this keyword. So if we do that (debatable of course), then whathever name we come up (like |
Slightly different direction but what about something like |
That depends on the behaviour of the option. Right now, I think the intent was to use eg nullable integer dtype for all integer columns, not only those integer columns that have missing values and would otherwise be casted to float. (Also, for booleans, they get casted to object right now if there are floats) It's true that not doing it for all columns can save memory, but personally I would prefer doing it for all columns: 1) you get a consistent result depending on the "logical" type of your column, not on the presence of missing values (eg if only reading in a part of the file, this can already differ 2) missing values can also be introduced after reading (for eg reindexing, merge, ..) and then having a nullable integer dtype ensures it doesn't get cast to float then, even if the original data didn't have nans |
Yea I agree - a clarification on that intent definitely drives this. I don't think it's worth adding the mask unless needed - it can certainly have non-trivial memory impacts. If my simple math is right for a 10 million row by 10 column |
But rather than limiting the columns that would get converted to masked / nullable dtypes with the option under discussion here, I would rather try to solve this memory concern by improving the implementation. The concrete ideas we have for this: 1) make the mask optional, so it can be None of there are no missing data (this should not be too hard to do I think) 2) investigate using a bitmask instead boolean array mask (this is probably harder, as there is no standard implementation of this in Python, so that will need some custom code). Note that the nullable dtypes are still experimental anyway (there are quite some operations that don't work yet, there are things that are slower, ..), so I think this option will in an initial phase mainly be for allowing to easily experiment, try it out. And for such a use case, I think it is more useful to convert all possible columns instead of addressing the memory concern by not converting all columns. |
From @WillAyd
I had another thought on this. Because of a method like Which may mean that a keyword such as And if we really want to stress that this is all about missing values, using |
@WillAyd regarding the memory issues of masked arrays: there is #30435 about making the mask optional and #31293 about exploring bitarrays for the mask.
Since np.nan in float dtype is also used as "missing value", I am not sure this is less ambiguous than
This is quite explicit! Another option is |
We could combine the two ideas, i.e.,
As the author of the above, I would vote against that in the I/O context because |
For me that is fine if that's a compromise that most people can live with. In that case we should update the doc sections on "Nullable integer data type" to something like "Integer data type with NA missing value" or .. (that's a bit long for a title though) But personally, I would just propose: let's define "nullable" as "dtype that uses NA" in context of the pandas docs / dtypes in pandas. It's a term we didn't use for anything else up to now (we otherwise don't use it when talking about missing values, all occurrences in the docs of this word are about the new dtypes) |
Another friendly ping .. @WillAyd How strong is your preference? Maybe a bit difficult to exactly answer, but meaning: I also still have a preference for For the keyword itself, I am relatively OK with |
I continue to think that @jreback was also dissenting on this so should see where he stands and go from there |
@WillAyd in that case, could you then answer to my question about what you would do with the docs? |
Sure I think referring to them as NA dtypes is clearer than Nullable dtypes |
have come around here, ok with |
Do we consider this a blocker for 1.1? If so, anyone want to work on it? |
i think we merge the current proposal |
Anyone (@Dr-Irv, @jorisvandenbossche) able to work on this? This seems worth doing for 1.1 if it only takes a few days. |
@TomAugspurger For me, it won't take a few days, because I think the changes should be made at a pretty low level in the readers, and I'd have to figure out how that code works. The easy solution is to use |
I was sent here from issue #35576. (Although #29752 (comment) is maybe suggesting that this does need to be a separate discussion?) Personally, the thing I care most about "turning off" in this upcoming edition of "Modern Pandas" is the fallback to the object dtype. (Because it makes things orders of magnitude slower, and it's pretty easy for it to happen "for you" behind the scenes). Yes, some of that is caused by needing As a very concrete use case, I would want a way for |
@chrish42 Look at this comment: #29752 (comment) The definition of a "nullable dtype" is "a pandas dtype supporting pd.NA". That includes |
@Dr-Irv Cool, thank you. That wasn't immediately clear to me. At least, unlike other times, there's no reason why object couldn't support Anyways, really looking forward to the day when automatic conversions to object (because strings) and to float (because NA) are a thing of the past. So thank you all! |
I'd recommend against it. Algorithms need to be written to explicitly handle NA since it's so unusual. |
see #32931 for dedicated issue |
@chrish42 if one is only interested in getting the new But, we certainly want a keyword to opt in to all nullable dtypes, so eg also nullable int and nullable bool to avoid casting to float etc. And then the name makes more sense. Adding yet another keyword for just getting |
I think this is done now |
With the new dtypes (
IntegerArray
,StringArray
, etc.), if you want to use them when reading in data, you have to specify the types for all of the columns. It would be nice to have the option to use the new dtypes for all columns as a keyword toread_csv()
,read_excel()
, etc.(ref. discussion in pandas dev meeting on 11/20/19)
The text was updated successfully, but these errors were encountered: