-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Inconsistencies in calculating second moments of a single value #7900
Comments
A couple questions:
|
see the rollimg_var stuff which 'fixes' the numerical issues part by using the Welford algorithms
|
Re: 1). Should the answer depend on whether Re: 2), I added an example to the end of the initial comment. (Sorry, I guess I should have added it as a separate comment.) |
Re: rounding near 0,
It doesn't deal with tiny positive errors. |
Here's an example showing that
The final value should be identically 0.0. This instability plays out in inconsistent results for the correlation of a constant series with itself:
In both cases the final value should be the correlation of [5., 5., 5.] with itself, yet one produces |
CC'ing @snth, @jaimefrio, and @kdiether, who appear to have worked on this/related code. |
The check for I don't think that rounding down to 0 very small values of The other thing I can commit to doing is implementing a variant of Welford's method for |
Sounds good. I think that if you add an explicit check for constant series, then there's no need to round small results to 0. Yeah, I have no strong opinion about whether the correlation of a constant series with itself should be |
Regarding the general inconsistencies between I will go through the list of functions above and see how the behavior accords with this principle. |
Added logic to `rolling_var` to detect windows where all non-NaN values are identical. Need to assess both correctness and performance impact.
After further review (and once #7912 is addressed), I think that all biased (
|
@jaimefrio, I'm going to include in #7926 a simple fix for the |
The How will users actually get to specify this parameter? Right now it is hardcoded to unbiased estimation, I have the fix for I have lost track of all the changes you are trying to pull together, but just so you know, I silently approve of your efforts from the distance... ;) |
Yeah, sorry, I've been encountering lots of little inconsistencies and edge cases that I've been trying to clean up. Some of the stuff I did is already in master (#7603, #7738, #7766, #7896, #7898 (which is obsolete in view of #7977), and #7934); #8059 is ready to be merged; and #7926 should be ready soon -- it has a lot of new consistency checks for the As for |
OK, I think I'm pretty much done w/ #7926, though I still need to rebase it after #8059 is merged into master. Note the
You'll see that several tests are commented out with comments of the form @jaimefrio, note in particular that I commented out the test for the correlation of a series with itself being identically @jaimefrio, note also that the tests call These tests are rather slow, and can probably be trimmed a bit (i.e. fewer distinct |
OK, #8059 has been merged into master, and I've updated #7926, which I think is now ready to be merged. @jaimefrio, you may find #7926's |
Add a check to rolling_var for repeated observations, in order to produce an exactly zero value of the variance when all entries are identical. Related to the discussion in pandas-dev#7900
I noticed (in #7884) that
ewmvar
,ewmstd
,ewmvol
,ewmcov
,rolling_var
, androlling_std
return0.0
for a single value (assumingmin_periods=0
); whereasSeries.std
,Series.var
,ewmcorr
,expanding_cov
,expanding_corr
,rolling_cov
, androlling_corr
all returnNaN
for a single value.expanding_std
andexpanding_var
produceValue Error: min_periods (2) must be <= window (1)
.I think all of these should all return
NaN
for a single value. At any rate, I would expect greater consistency one way or the other.Mildly related, when calculating the correlation of a constant series with itself,
Series.corr()
,expanding_corr
, androlling_corr
returnNaN
, whileewmcorr
sometimes returnsNaN
, sometimes1
and sometimes-1
, due to numerical accuracy issues. Actually, as shown in a separate comment below,rolling_corr
also produces inconsistent results for a constant subseries following different prior values.Inconsistencies in calculating second moments of a single point:
Instability in
ewmcorr
of a constant series with itself:The text was updated successfully, but these errors were encountered: