-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bottleneck : Wrong mean for float32 array #1346
Comments
I can't reproduce this: In [6]: ds = xr.open_dataset('./Downloads/ERAIN-t2m-1983-2012.seasmean.nc')
In [7]: ds.var167.mean()
Out[7]:
<xarray.DataArray 'var167' ()>
array(278.6246643066406, dtype=float32)
In [8]: ds.var167.data.mean()
Out[8]: 278.62466 which version of xarray, dask, python are you using? |
Ok, I am on MacOS:
|
Also on macOS, and I can reproduce. Using python 2.7.11, xarray 0.9.1, dask 0.14.1 installed through Anaconda. I get the same results with xarray 0.9.1-38-gc0178b7 from GitHub. In [3]: ds = xarray.open_dataset('ERAIN-t2m-1983-2012.seasmean.nc')
In [4]: ds.var167.mean()
Out[4]:
<xarray.DataArray 'var167' ()>
array(261.6441345214844, dtype=float32) Curiously, I get the right results with skipna=False... In [10]: ds.var167.mean(skipna=False)
Out[10]:
<xarray.DataArray 'var167' ()>
array(278.6246643066406, dtype=float32) ... or by specifying coordinates to average over: In [5]: ds.var167.mean(('time', 'lat', 'lon'))
Out[5]:
<xarray.DataArray 'var167' ()>
array(278.6246643066406, dtype=float32) |
Does it make a difference if you load the data first? ( |
I think this might be a problem with bottleneck? My interpretation of _create_nan_agg_method in xarray/core/ops.py is that it may use bottleneck to get the mean unless you pass skipna=False or specify multiple axes. And, In [2]: import bottleneck
In [3]: bottleneck.__version__
Out[3]: '1.2.0'
In [6]: bottleneck.nanmean(ds.var167.data)
Out[6]: 261.6441345214844 Forgive me if I'm wrong, I'm still a bit new. |
Yes, this is probably related to the fact that The fact that the dtype is float32 is a sign that this is probably a numerical precision issue. Try casting with If you really cared about performance using float32, the other thing to do to improve conditioning is to subtract and add a number close to the mean, e.g., |
Thanks all guys for the replies. |
@matteodefelice you didn't decide on float32, but your data is stored that way. It's really hard to make choices about numerical precision for computations automatically: if we converted automatically to float64, somebody else would be complaining about unexpected memory usage :). Looking at our options, we could:
|
Sorry to unearth this issue again, but I just got bitten by this quite badly. I'm looking at absolute temperature perturbations and bottleneck's implementation together with my data being loaded as Example:
Would it be worth adding a warning (until the right solution is found) if someone is doing Based a little experimentation (https://gist.github.com/leifdenby/8e874d3440a1ac96f96465a418f158ab) bottleneck's mean function builds up significant errors even with moderately sized arrays if they are |
I would rather pick option (1) above, that is, "Stop using bottleneck on float32 arrays" |
Is it worth changing bottleneck to use double for single precision reductions? AFAICT this is a matter of changing |
I think (!) xarray is not effected any longer, but pandas is. Bisecting the GIT history leads to commit 0b9ab2d, which means that xarray >= v0.10.9 should not be affected. Uninstalling bottleneck is also a valid workaround.
A couple of minimal examples: >>> import numpy as np
>>> import pandas as pd
>>> import xarray as xr
>>> import bottleneck as bn
>>> bn.nanmean(np.ones(2**25, dtype=np.float32))
0.5
>>> pd.Series(np.ones(2**25, dtype=np.float32)).mean()
0.5
>>> xr.DataArray(np.ones(2**25, dtype=np.float32)).mean() # not affected for this version
<xarray.DataArray ()>
array(1., dtype=float32) Done with the following versions: $ pip3 freeze
Bottleneck==1.2.1
numpy==1.16.1
pandas==0.24.1
xarray==0.11.3
... |
Ah ok, I suppose bottleneck is indeed now avoided for float32 xarray. Yeah that issue is for a different function, but the source of the problem and proposed solution in the thread is the same - use higher precision intermediates for float32 (double arithmetic); a small speed vs accuracy/precision trade off. |
Oh hm, I think I didn't really understand what happens in Isn't this what bottleneck is doing? Summing up a bunch of float32 values and then dividing by the length?
|
The difference is that Bottleneck does the sum in the naive way, whereas NumPy uses the more numerically stable pairwise summation. |
Oh yes, of course! I've underestimated the low precision of float32 values above 2**24. Thanks for the hint. |
|
Yes that sounds right. Thanks! |
On second thought we should add this to a FAQ page. |
I think it is better to have this discussion here instead of on the
dask
page dask/dask#2095This is the replicable "bug":
The dataset is ~65 MB, here the file https://www.dropbox.com/s/xtj3fm7ihtbwd5r/ERAIN-t2m-1983-2012.seasmean.nc?dl=0
It is a quite normal NetCDF (no NaN), just processed with CDO as you can see on the dask issue.
The text was updated successfully, but these errors were encountered: