-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: distinguish NA vs NaN in floating dtypes #32265
Comments
How do other tools / languages deal with this? Julia has both as separate concepts: julia> arr = [1.0, missing, NaN]
3-element Array{Union{Missing, Float64},1}:
1.0
missing
NaN
julia> ismissing.(arr)
3-element BitArray{1}:
false
true
false
julia> isnan.(arr)
3-element Array{Union{Missing, Bool},1}:
false
missing
true R also has both, but will treat NaN as missing in > v <- c(1.0, NA, NaN)
> v
[1] 1 NA NaN
> is.na(v)
[1] FALSE TRUE TRUE
> is.nan(v)
[1] FALSE FALSE TRUE Here, the "skipna" > sum(v)
[1] NA
> sum(v, na.rm=TRUE)
[1] 1 Apache Arrow also has both (NaN can be a float value, while it tracks missing values in a mask). It doesn't yet have much computational tools, bug eg the I think SQL also has both, but didn't yet check in more detail how it handles NaN in missing-like operations. |
I still don't know the semantics of So ideally pd.NA and np.nan should be the same to users. If, as I understand, this is not possible given how pd.NA was designed and the compatibility we want to (rightfully) keep with numpy, I think the discrepancies should be limited as much as possible. |
Just to provide an example: I want to compute average hourly wages from two variables: monthly hours worked and monthly salary. If for a given worker I have 0 and 0, in my average I will want to disregard this observation precisely as if it was a missing value. In this and many other cases, missing observations are the result of float operations. |
Agreed.
My initial preference is for not having both. I think that having both will be confusing for users (and harder to maintain). |
agree with Tom here
I think R is even go too far as this introduces enormous mental complexity; now I have 2 missing values? sure for the advanced user this might be ok but most don’t care and this adds to the development burden
that said if we could support both np.nan and pd.NA with limited complexity;
as propagating values (and both fillable) IOW they are basically the same except that we do preserve the fact that a np.nan can arise from a mathematical operatio
then would be onboard.
… On Feb 26, 2020, at 6:31 AM, Tom Augspurger ***@***.***> wrote:
Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me.
Agreed.
do we want both pd.NA and np.nan in a float dtype and have them signify different things?
My initial preference is for not having both. I think that having both will be confusing for users (and harder to maintain).
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix. I've had cases where the source data (or some other calculation) I did produced a I think we should support
We currently also have this inconsistent (IMHO) behavior which relates to (2) above: >>> s=pd.Series([1,2,pd.NA], dtype="Int64")
>>> s
0 1
1 2
2 <NA>
dtype: Int64
>>> s.to_numpy()
array([1, 2, <NA>], dtype=object)
>>> s
0 1
1 2
2 <NA>
dtype: Int64
>>> s.astype(float).to_numpy()
array([ 1., 2., nan]) |
Definitely. To me, this is precisely the role of pd.NA - or anything denoting missing data. If you take a monthly average of something that didn't happen in a given month, it is missing, not a sort of strange floating number. Notice I'm not claiming the two concepts are the same, but just that there is no clear-cut distinction, and even less some natural one for users.
Sure. And I think we have all the required machinery to behave as the user desires on missing data (mainly, the |
When I said "such a calculation could indicate something wrong in the data that you need to identify and fix.", the thing that could be wrong in the data might not be missing data. It could be that some combination of values occurred that were not supposed to happen. There are just two use cases here. One is where the data truly has missing data, like your example of the monthly average. The second is where all the data is there, but some calculation you did creates a
Yes, but |
My point is precisely that in my example missing data causes a 0 / 0. But it really originates from missing data. Could 0/0 result in pd.NA? Well, we would deviate not just from numpy, but also from a large number of cases in which 0/0 does not originate from missing data. |
This is true. But... are there new usability insights compared to those we had back in 2017? |
That's why I think having |
+1 for consistency with other computational tools. On the subject of automatic conversion into NumPy arrays, return an object dtype array seems consistent but could be a very poor user experience. Object arrays are really slow, and break many/most functions that expect numeric NumPy arrays. Float dtype with auto-conversion from NA -> NaN would probably be preferred by users. |
I think using NA even for missing floats makes a lot of sense. In my opinion the same argument that NaN is semantically misleading for missing strings applies equally well to numeric data types. It also seems trying to support both NaN and NA might be too complex and could be a significant source of confusion (I would think warnings / errors are the way to deal with bad computations rather than a special value indicating "you shouldn't have done this"). And if we're being pedantic NaN doesn't tell you whether you're dealing with 0 / 0 or log(-1), so it's technically still NA. :) |
I propose that from now on we use a branch of |
Thanks all for the discussion!
I think there is, or at least, we now have one: for the new pd.NA, we decided that it propagates in comparisons, while np.nan gives False in comparisons (based on numpy behaviour, based on floating spec). Whether this is "natural" I don't know, but I think it is somewhat logical to do.
Note that it's not necessarily "2 missing values", but rather a "missing value" and a "not a number". Of course, current users are used to see NaN as a missing value. For those, there is of course initial confusion to no longer see NaN as a missing value. And this is certainly an aspect not to underestimate.
There is already one for infinity (which is actually very similar to NaN, see more below):
Yes, I agree it would be nice to follow numpy for those cases that numpy handles (which is things that result in NaN, like 0/0). Having different behaviour for pd.NA is fine I think (like the different propagation in comparison ops), since numpy doesn't have that concept (so we can't really "deviate" from numpy). From talking with @TomAugspurger and looking at examples, I somewhat convinced myself that making the distinction makes sense (not sure if it convinced @TomAugspurger also, though ;), and there are still a lot of practical concerns) >>> s = pd.Series([0, 1, 2]) / pd.Series([0, 0, pd.NA], dtype="Int64")
>>> s
0 NaN
1 inf
2 NaN
dtype: float64
>>> s.isna()
0 True
1 False
2 True
dtype: bool The above is the current behaviour (where the original NA from Int64 dtype also gives NaN in float, but with a potential new float dtype, the third value would be instead of NaN). Based on that, I think the following (hypothetical) behaviour actually makes sense: >>> s = pd.Series([0, 1, 2]) / pd.Series([0, 0, pd.NA], dtype="Int64")
>>> s
0 NaN
1 inf
2 <NA>
dtype: float64
>>> s.isna()
0 False
1 False
2 True
dtype: bool As long as we ensure when creating a new "nullable float" series, that missing values (NA) are used and not NaN (unless the user explicitly asks for that), I think most users won't often run into having a NaN, or not that much more often than Inf (which already has the "non-missing" behaviour). |
@shoyer I agree the object dtype is poor user experience. I think we opted (for now) for object dtype, since this is kind of the most "conservative" option: it at least "preserves the information", although in such a mostly useless way that it's up to the user to decide how to convert it properly. Another hard topic, in case we no longer see np.nan as missing in a new nullable float dtype, will be how to treat nans in numpy arrays. |
My opinion is that the new pd.NA behaves under this respect in a more "natural" way than the floating spec - at least in a context in which users work with several different dtypes. Hence I respect the decision to deviate. I just would limit the deviation as much as possible. To be honest (but that's maybe another discussion, and I didn't think much about the consequences) I would be tempted to completely eliminate np.nan from floats (replace with pd.NA), to solve this discrepancy (even at the cost of deviating from numpy).
Actually, your example reinforces my opinion on not making the distinction (where possible).
1 / 0 gives inf: this clearly suggests about some limit (to 0 from right); -1 / 0 gives -inf: same story; 0 / 0 gives NaN. Why? Clearly because depending on how you converge to 0 in the numerator, you could have 0, inf, or any finite number. So this NaN really talks about missing information, not about some "magic" or "unrepresentable" floating point number. Same holds with What I mean is: in the floating specs, NaN already denotes two different cases: of missing information, and of "unfeasible [within real numbers]" operation (together with any combinations of those - in particular when you propagate NaNs). I know we all have in mind the distinction "I find missing values in my data" vs. "I produce missing values while handling the data". But this is a dangerous distinction to shape API, because the data which is an input for someone was an input for someone else. Are we really saying that if I do 0/0 it is "not a number", while if my data provider does exactly the same thing before passing me the data it is "missing data" that should behave differently?! What if at some step of my data pipeline... I am my data provider? Should we make np.NaN persist as pd.NA any time we save data to disk?! |
Sorry @toobaz, I don't understand your reasoning here (or just disagree, that's also possible). 0 and 0 are clearly both non-missing values in your data, so for me this "clearly" is not a case of talking missing information, but rather an unrepresentable floating point number. 0 and 0 can both be perfectly valid values in both series, it's only their combination and the specific operation that makes them invalid. Also, you then say that
That's indeed a problem if this is roundtripping through numpy, in which case we can't make the distinction (eg if you receive the data from someone else as a numpy array). |
Why do 1/0 and 0/0 - both of which, strictly speaking, have no answer (even outside reals) - lead to different results? The only explanation I can see is that you can imagine 1/0 as a limit tending to infinity, while in the case of 0/0 you really have no clue. That "no clue" for me means "missing information", not "error". If the problem was "arithmetic error", you'd have "1/0 = error" as well. Now, I'm not saying I can read in the mind of whoever wrote the standard, or that I particularly agree with this choice, but this really reminds me (together with my example above about monthly averages, which is maybe more practical) that the difference between "missing" and "invalid" is very subtle, so much so that our intuition about what is missing or not seems already different from that which is present in the IEEE standard.
... I'm taking this as something we would consider if we distinguish the two concepts. And since it's everything but obvious (to me at least), I consider this as an argument for not distinguishing the two concepts.
I was actually not making a point of "we are constrained by implementation", but really of "what should we conceptually do?". Do we want np.NaN as different from pd.NA because it helps us identify code errors we might want to solve? OK, then once we give up fixing the error in code (for instance because the 0/0 legitimately comes from an average on no observations) we should replace it with pd.NA. Creating np.NaN might be perfectly fine, but distributing it (on pd.NA-aware formats) would be akin to a programming mistake. We are really talking about the result of elementary operations which would (have to) become very context-dependent. Anyway, if my arguments so far are not seen as convincing, I propose another approach: let us try to define which pandas operations currently producing np.NaN should start to produce pd.NA if we wanted to distinguish the two. For instance: if What should But these are really the same mathematical operation. OK, so maybe we would solve the inconsistency if |
I think what @toobaz is saying is that 0 / 0 truly is indeterminate (if we think of it as the solution to 0x = 0, then it's essentially any number, which isn't too different from the meaning of NA). The log(-1) case is maybe less obvious, but I think you could still defend the choice to represent this as NA (assuming you're not raising an error or using complex numbers) by saying that you're returning "no answer" to these types of queries (and that way keep the meaning as missing data). I guess I'm still unsure what would be the actual utility of having another value to represent "bad data" when you already have NA for null values? If you're expecting to see a number and don't (because you've taken 0 / 0 for example), how much more helpful is it to see NaN instead of NA? To me this doesn't seem worth the potential confusion of always having to code around two null values (it's even not obvious if we should treat NaN as missing under this new interpretation; if the answer is no then do we now have to check for two things in places where otherwise we would just ask if something is NA?), and having to remember that they each behave differently. Using only NA would also seemingly make it easier to translate from numpy to pandas (np.nan is always pd.NA, rather than sometimes pd.NA, and other times np.nan depending on context) (A bit of a tangent from this thread, but reading about infinity above made me wonder if this could also be a useful value to have in other non-float dtypes, for instance infinite Int64 or Datetime values?) |
I am coming around to the idea that distinguishing between NaN and NA may not be worth the trouble. I think it would be pretty reasonable to both:
|
I totally agree with @shoyer 's proposal. It would be nice to leave a way for users to force keeping np.NaNs as such (in order to keep the old comparisons semantics, and maybe even to avoid the conversions performance hit?), but it might be far from trivial, and hence not worth the effort. |
I’m probably fine with transparently concerting NA to NaN in asarray for float dtypes. I’m less sure for integer, since that goes against our general rule of not being lossy. |
I agree. Without pd.NA, pandas users sooner or later were going to get accustomed to ints with missing values magically becoming floats, but that won't be true any more. (Ideally, we would want a numpy masked array, but I guess |
I agree it is not obvious what is fundamentally "best". But, if we don't have good arguments either way, that could also be a reason to just follow the standard and what numpy does.
In theory, I think there can be a clear cut: we could produce NaN whenever an operation with numpy produces a NaN, and we produce NAs whenever it is a pandas concept such as alignment or
OK, that I understand!
Apart from the (possible) utility for users to be able to represent both (which is of course a trade-off with the added complexity for users of having both), there are also other clear advantages of having both NaN and NA, I think:
This last item of course gets us into the implementation question (which I actually wanted to avoid initially). But assuming we go with:
would people still use NaN as a sentinel for NA, or use a mask and ensure all NaN values in the values are also marked in the mask? |
I agree that for the conversion to numpy (or at least |
FWIW, @kkraus14 noted that cuDF supports both NaN and NA in their float columns, and their users are generally happy with the ability to have both. |
I had a discussion with @jorisvandenbossche today and I would really like to have more discussion on Option 3 above, but with a small twist that I'll call Option 4. Option 4: distinguish NaN and NA, and don't treat NaN as missing (by default) except upon conversion, and add helper methodsWe distinguish NaN and NA everywhere (also in methods that deal with missing values where ONLY In the above examples:
We add helper functions:
When converting from a float array that has The ONLY way that When someone has a Users who are using |
I don't have strong feelings about how pandas should handle NaN, but I would note that NaN is a floating point number thing, not a NumPy thing. So if pandas has NaN specific APIs, they shouldn't refer to something like "npnan". |
Hey @Dr-Irv - just to check, is there any difference between what you're suggesting and the status quo? (other than the extra helper methods) |
Interesting. I just did some tests, and almost everything is there. Except this:
E.g.: >>> s=pd.Series([0,1,pd.NA], dtype="Int64")
>>> t=s/s
>>> t
0 NaN
1 1.0
2 <NA>
dtype: Float64
>>> t.to_numpy()
array([nan, 1.0, <NA>], dtype=object) |
Also, pyarrow dtypes (float/double) handle nan differently than the numpy dtypes:
so NaN is NOT considered a na/null. But when I do the same with numpy floats it IS considered to be na/null.
I guess this might also be a bug? |
Hello, I am still interested in this issue and see a maintainer @jbrockmendel added the label to it and several similar ones for “Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint”. But that plan isn’t mentioned (by this excellent name or other) elsewhere in the repo - could someone briefly state what was agreed to? |
#56836 was just closed with a reference to this issue. Pandas is inconsistently mixing up NaN and NA entries in a column of a data type that was specifically designed to add nullability over the still existing non nullable float data types. That's an obvious bug from the user perspective, and nothing is being done about it because of a four year discussion that's going nowhere. People that don't want to make a distinction between "invalid operation" (NaN) and "missing data" (NA) can use non-nullable floats; people that do can use nullable floats. Wasn't that the whole point of nullability? Legacy concerns put aside, does anybody disagree with this? (@TomAugspurger @toobaz @MarcoGorelli I think you made comments before that expressed disagreement) If not, could we reopen #56836 and consider it the first step as part of the de-legacy-ation plan here? (In fact, I have the feeling this ticket here was originally about whether NaN and NA should be allowed to coexist, but now everybody seems to have come around to that, so the ticket has turned more into a discussion about steps towards allowing it and making the rest of the system sane, like changing As an outsider, this discussion looks like pandas has been stuck for four years just because of an accidental prefix match between two unrelated identifiers. |
@soerenwolfers in the proposals listed at #32265 (comment) and #32265 (comment) , we are proposing to convert
It's more that when pandas was created 15+ years ago, it was based on |
@Dr-Irv Great, thanks for the clarification. FWIW, my vote is on Option 2. IMO Options 3 and 4 might be useful transition steps, but would be serious reason to not use pandas if kept. |
While this discussion is still stuck (or ongoing if you like), there are some things that could be done regardless. For example, #55787 can be re-opened and addressed: whether or not |
@avm19 depending on which way the issue here is resolved eventually, and depending on which way you propose the functions in the issue you linked should be equalized, the changes that you propose might have to be rolled back later. I guess that's something that the developers want to avoid. The only thing that's worse than internal inconsistency is unnecessary inconsistency between versions. |
Another unexpected behavior is that the This is very inconsistent
|
After reviewing this discussion here and in PDEP-16, I get the general consensus is that there is value in distinguishing these, but there is a lot of concern around the implementation (and rightfully so, given the history of pandas) With that being the case, maybe we can just concretely start by adding the Right now the pd.FloatXXDtype() data types become practically unusable the moment a I think starting with a smaller scope to just those few methods is helpful; trying to solve constructors is going to open a can of worms that will deter any progress, and I think more generally should be solved through PDEP-16 anyway (or maybe the follow up to it that focuses on helping to better distinguish these values). |
I upvote starting with something that can be improved short-term vs needing to first reach consensus on a new holistic design. |
Considering the fact |
For sure, but our history does make things complicated. Unfortunately, for over a decade pandas users have been commonly doing: ser.iloc[0] = np.nan to assign what they think is a "missing value". So we can't just immediately change that to literally mean There is a larger discussion to that point in PDEP-0016 that you might want to chime in on and follow |
Context: in the original
pd.NA
proposal (#28095) the topic aboutpd.NA
vsnp.nan
was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context ofnp.nan
for float aspd.NaT
for datetime-like).With the introduction of
pd.NA
, and if we want consistent "NA behaviour" across dtypes at some point in the future, I think there are two options for float dtypes:np.nan
as we do now, but change its behaviour (e.g. in comparison ops) to matchpd.NA
pd.NA
in float dtypesPersonally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of
np.nan
.For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as "missing" or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)
Actual discussion items: assume we are going to add floating dtypes that use
pd.NA
as missing value indicator. Then the following question comes up:So yes, it is technically possible to have both np.nan and pd.NA with different behaviour (
np.nan
as "normal", unmasked value in the actual data,pd.NA
tracked in the mask). But we also need to decide if we want this.This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in #28095:
vs
So I think those two describe nicely the two options we have on the question do we want both
pd.NA
andnp.nan
in a float dtype and have them signify different things? -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and "disallow" NaN (or interpret / convert any NaN on input to NA).A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post).
That reasoning was given by @Dr-Irv in #28095 (comment): there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning "missing data". So should there be separate markers - one to mean "missing value" and the other to mean "bad computational result" (typically
0/0
) ?A dummy example showing how both can occur:
The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).
So, yes, it is possible and potentially desirable to allow both
pd.NA
andnp.nan
in floating dtypes. But, it also brings up several questions / complexities. Foremost, should NaN still be considered as missing? Meaning, should it be seen as missing in functions likeisna
/notna
/dropna
/fillna
? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have askipna
keyword, like sum, mean, etc)?Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).
Some other various considerations:
Having both pd.NA and NaN (np.nan) might actually be more confusing for users.
If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)
How do we handle compatibility with numpy?
The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have a
to_numpy(.., na_value=np.nan)
explicit conversion.But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying.
For conversion to numpy, see also some relevant discussion in API: how to handle NA in conversion to numpy arrays #30038
What with conversion / inference on input?
Eg creating a Series from a float numpy array with NaNs (
pdSeries(np.array([0.1, np.nan]))
) Do we convert NaNs to NA automatically by default?cc @pandas-dev/pandas-core @Dr-Irv @dsaxton
The text was updated successfully, but these errors were encountered: