-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Ensure that rolling_var of identical values is exactly zero #8271
Conversation
looks fine cc @seth-p ? |
@jaimefrio you are checking if the most recent non-nan is repeated. is this TOO restrictive? |
you are making the code more complex, without adding any practical usage. var is by definition a floating point number. if your code depends on floating point numbers begin exactly equal, then there is something wrong somewhere else. should there be also a check for when var is exactly 1.0? or 2.0? |
Not sure I understand the question, do you mean the test or the code? The only thing we are checking for in the code is whether all non-NaN observations in the window are identical ( The test is designed to test several of the code paths that can lead to explicitly setting the variance to |
if this patch goes in, i will submit mine for the case that var is exactly 1.0. |
@jaimefrio makes sense @behzadnouri not sure of your point. This has to do with repeated observations, NOT a special case check on the actual values. |
@behzadnouri You may want to see the discussion in #7900 for some context. Not all floating point numbers are created equal, but even if they were, I'd argue that The practical use of this is related to the rolling variance appearing as a factor in the denominator of several other expressions, namely the rolling correlation. Exactly detecting zero values in the denominator (an numerator) of those expressions is the only way of figuring the right value for them, instead of returning a meaningless value by dividing by an arbitrarily close to 0 value. |
@jaimefrio see this SO question. What is the variance of this array: @jreback master branch already returns correct results for the added test, as far as floating point arithmetic is understood. |
Overall I think the code looks good, but I have a few comments/questions:
In order to pass this test, I think need to implement |
@behzadnouri, I think you're letting the perfect be the enemy of the good. No, this PR does not solve all floating-point inaccuracies. But there are certain basic identities that one wants to hold (see I'm in favor of including this PR, though as I mentioned above, think it would be even better to re-implement as |
is their a case where this DOES fail on current master? iow 'proves' (for the near 0 case) that this blows up because of numerical inaccuracies? |
Though I haven't checked, I presume the problem described in #7900 (comment) still remains in master until this PR is included. |
So yes, it makes plenty of sense to implement a proper I can also confirm that the commented test is failing, although the check may be a little too strict right now, here's a debugger session after the failure:
|
Re 2. That's fine. Just wanted to confirm, as it wasn't 100% obvious at first glance. Re 3. This is what I did for |
Re 3 continued. I don't think the test is too strict. I believe that |
I may take a look at implementing |
Since A question that may be relevant to the resulting performance of I think the relevant case to consider is something ike:
If we remove the guarantee of exactly zero, I am not 100% sure that in a case like this rounding errors may not lead to a |
Yes, I think |
Add a check to rolling_var for repeated observations, in order to produce an exactly zero value of the variance when all entries are identical. Related to the discussion in pandas-dev#7900
7fca7b1
to
79f699a
Compare
I have made the small style modification that @seth-p suggested and made another mostly style-related modification to the test. I think this is now ready to go. |
Looks fine to me. |
I have managed to put together a working Cython As a small teaser, I have also done some timings, and for large-ish arrays, doing
As implemented right now,
|
@seth-p @jaimefrio instead of comforting each other please do some readings. You are making the code more complex and less efficient without any practical use because of your lack of understanding of floating point arithmetic. |
@behzadnouri, let's try to be constructive and not let the conversation degenerate to ad-hominem attacks. What you seem not to acknowledge is that the difference between a variance of 0 and a variance of 10^-16 is materially/qualitatively different form the difference between a variance of 1 and a variance of 1+10^e-16. If you do not agree, I'm happy to explain why I believe this is so. If you do agree with my previous statement, perhaps you still think that properly handling the variance=0 case does not merit extra tests making the code slightly more complex. If this is the case, then I think we simply have an honest disagreement. If you disagree with my belief that there's something special about a variance of 0 (whether or not it merits extra code), perhaps I can take you up on your offer to submit a PR "for the case that var is exactly 1.0." :-) |
Pls keep this discussion civil, both @seth-p and @jaimefrio are smart guys. You are welcome to refute and criticize arguments and have honest disagreements. But noone wants personal attaches. That said, I would encourage you to provide a counter-argument to @seth-p points. Python and pandas strive to be practical, efficient, have a nice API, simple, and give the exactly to the nth decimal place correct answers. Not all of these are possible at the same time. We take different approaches to numerical stability, e.g. : #6817 (for efficiency), and here #8002 to provide numerical stable values. I think it is most practical to provide the user an exactly 0 value when it is close to 0 that its immaterial. For 99.9% of use cases that will suffice. I am open to include a precision keyword to alleviate those cases where it doesnt. |
@jreback you are adding corner cases to the code. this will make it less maintainable, less efficient and more complex, and I still don't see any practical use. for the record: given that I had provided this example above:
but we are still talking about variance of 0 versus 10^-16, i will not comment on this pr further; |
@behzadnouri The problem we are trying to solve is the following, which I have just run with current master:
I hope we agree that last value should be "the same" in both cases. With floating point, there may be more than one valid definition of what "the same" exactly means, but I am not aware of any in which zero is the same as NaN! This huge instability in the computed values happens because the "true" result we are after is |
@jaimefrio this is fine. can you add a release note in v0.15.0.txt (api section), referencing this PR number (as their is no issue I believe). Also do we need a note / warning in docs and/or doc-string (maybe a note here?) |
I have mixed feelings about this... Although he had trouble presenting it, I kind of resonate with @behzadnouri's complaints about this approach. But regardless of philosophical principles, Why vouch for this? Because it keeps So my vote is for closing this and making #8326 work. It is going to require some effort, mostly to get the I would really like to hear @jreback's and @seth-p's thoughts on this whole mess. Even if you think that the standalone |
closing this in favor of fixing up with #8326 |
I don't think |
Add a check to rolling_var for repeated observations, in order to
produce an exactly zero value of the variance when all entries are
identical. Related to the discussion in #7900