-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray #2304
Comments
To clarify: why is it a problem for you to get floating point values like 21.939998626708984 instead of 21.940000000000001? Is it a loss of precision in some downstream calculation? Both numbers are accurate well within the precision indicated by the netCDF file (0.01). Note that it's very easy to later convert from float32 to float64, e.g., by writing |
Thank you for your quick answer. |
You're right when you say
You'll have a float64 in the end but you won't get your precision back and it might be a problem in some case. I understand the benefits of using float32 on the memory side but it is kind of a problem for us each time we have variables using scale factors. I'm surprised this issue (if considered as one) does not bother more people. |
As mentioned in the original issue the modification is straightforward. |
Some people might prefer float32, so it is not as straightforward as it seems. It might be possible to add an option for this, but I didn't look into the details.
Note that this is a fake sense of precision, because in the example above the compression used is lossy, i.e. precision was lost at compression and the actual precision is now 0.01:
|
A float32 values has 24 bits of precision in the significand, which is more than enough to store the 16-bits in in the original data; the exponent (8 bits) will more or less take care of the >>> import numpy as np
>>> np.float32(2194 * 0.01)
21.94 What you're seeing is an artifact of printing out the values. I have no idea why something is printing out a float (only 7 decimal digits) out to 17 digits. Even float64 only has 16 digits (which is overkill for this application). The difference in subtracting the 32- and 64-bit values above are in the 8th decimal place, which is beyond the actual precision of the data; what you've just demonstrated is the difference in precision between 32-bit and 64-bit values, but it had nothing to do whatsoever with the data. If you're really worried about precision round-off for things like std. dev, you should probably calculate it using the raw integer values and scale afterwards. (I don't actually think this is necessary, though.) |
Right. The actual raw data is being stored as an integer I would be happy to add options for whether to default to float32 or float64 precision. There are clearly tradeoffs here:
I don't think we can make a statement about which is better in general. The best we can do is make an educated guess about which will be more useful / less surprising for most and/or new users, and pick that as the default. |
@shoyer But since it's a downstream calculation issue, and does not impact the actual precision of what's being read from the file, what's wrong with saying "Use |
It's almost but not quite identical. The difference is that the data gets scaled twice. This adds twice the overhead for scaling the values (which to be fair is usually negligible compared to IO). Also, to get exactly equivalent numerics for further computation you would need to round again, e.g., |
I'm not following why the data are scaled twice. Your point about the rounding being different is well-taken, though. |
We automatically scale the data from int16->float32 upon reading it in xarray (if decode_cf=True). There's no way to turn that off and still get automatic scaling, so the best you can do is layer on int16->float32->float64, when you might prefer to only do int16->float64. |
Ah, ok, not scaling per-se (i.e. |
Both multiplying by 0.01 and float32 -> float64 are approximately
equivalently expensive. The cost is dominated by the memory copy.
…On Mon, Aug 6, 2018 at 10:17 AM Ryan May ***@***.***> wrote:
Ah, ok, not scaling per-se (i.e. * 0.01), but a second round of value
conversion.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2304 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1oEOX3WI7oaPDOQb7R59UgDyPXDsks5uOHozgaJpZM4VbG9w>
.
|
To explain the full context and why it became some kind of a problem to us : We're experimenting with the parquet format (via pyarrow) and we first did something like : We're now looking at xarray and the huge ease of access it offers to netcdf like data and we tried something similar : Our problem appears when we're reading and comparing the data stored with these 2 approches. There might be something wrong in our process but it originate here with this float32/float64 choice so we thought it might be a problem. Thanks for taking the time to look into this. |
Please let us know if converting to float64 explicitly and rounding again
does not solve this issue for you.
…On Mon, Aug 6, 2018 at 10:47 AM Thomas Zilio ***@***.***> wrote:
To explain the full context and why it became some kind of a problem to us
:
We're experimenting with the parquet format (via pyarrow) and we first did
something like :
netcdf file -> netcdf4 -> pandas -> pyarrow -> pandas (when read later on).
We're now looking at xarray and the the huge ease of access it offers to
netcdf like data and we tried something similar :
netcdf file -> xarray -> pandas -> pyarrow -> pandas (when read later on).
Our problem appears when we're reading and comparing the data stored with
these 2 approches.
The difference between the 2 was - sometimes - larger than what
expected/acceptable (10e-6 for float32 if I'm not mistaken).
We're not constraining any type and letting the system and modules decide
how to encode what and in the end we have significantly different values.
There might be something wrong in our process but it originate here with
this float32/float64 choice so we thought it might be a problem.
Thanks for taking the time to look into this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2304 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1iZHdJnGlkA_dHGHFonA27lIM2xHks5uOIErgaJpZM4VbG9w>
.
|
So, a more complete example showing this problem. from netCDF4 import Dataset
import xarray as xr
import numpy as np
import pandas as pd
d = Dataset("test.nc")
v = d.variables['var']
print(v)
#<class 'netCDF4._netCDF4.Variable'>
#int16 var(idx)
# _FillValue: 32767
# scale_factor: 0.01
#unlimited dimensions:
#current shape = (2,)
#filling on
df_nc = pd.DataFrame(data={'var': v[:]})
print(df_nc)
# var
#0 21.94
#1 27.04
ds = xr.open_dataset("test.nc")
df_xr = ds['var'].to_dataframe()
# Comparing both dataframes with float32 precision (1e-6)
mask = np.isclose(df_nc['var'], df_xr['var'], rtol=0, atol=1e-6)
print(mask)
#[False True]
print(df_xr)
# var
#idx
#0 21.939999
#1 27.039999
# Changing the type and rounding the xarray dataframe
df_xr2 = df_xr.astype(np.float64).round(int(np.ceil(-np.log10(ds['var'].encoding['scale_factor']))))
mask = np.isclose(df_nc['var'], df_xr2['var'], rtol=0, atol=1e-6)
print(mask)
#[ True True]
print(df_xr2)
# var
#idx
#0 21.94
#1 27.04 As you can see, the problem appears early in the process (not related to the way data are stored in parquet later on) and yes, rounding values does solve it. |
Any updates about this ? |
I think we are still talking about different things. In the example by @Thomas-Z above there is still a problem at the line: # Comparing both dataframes with float32 precision (1e-6)
mask = np.isclose(df_nc['var'], df_xr['var'], rtol=0, atol=1e-6) As discussed several times above, this test is misleading: it should assert for @shoyer said:
so we would welcome a PR in this direction! I don't think we need to change the default behavior though, as there is a slight possibility that some people are relying on the data being float32. |
Hi, In this case we would add a parameter to the open_dataset function called "force_promote" which is a boolean and False by default and thus not mandatory. if dtype.itemsize <= 2 and not force_promote: The downside of that is that we somehow pollute the code with a parameter that is used in a specific case. The second approach would check the value of an environment variable called "XARRAY_FORCE_PROMOTE" if it exists and set to true would force promoting type to float64. please tells us which approach suits best your vision of xarray. Regards. |
Hi everyone, |
@magau thanks for pointing this out -- I think we simplify missed this part of the CF conventions document! Looking at the dtype for We will still need some fall-back choice for |
Hey everyone, tumbled on this while searching for approximately the same problem. Thought I'd share since the issue is still open. On my part, there is two situations that seem buggy. I haven't been using xarray for that long yet so maybe there is something I'm missing here... My first problem relates to the data types of dimensions with float notation. To give another answer to @shoyer's question:
it is a problem in my case because I would like to perform slicing operations of a dataset using longitude values from another dataset. This operation raises a "KeyError : not all values found in index 'longitude'" since either one of the dataset's longitude is float32 and the other is float64 or because both datasets' float32 approximations are not exactly the same value in each dataset. I can work around this and assign new coords to be float64 after reading and it works, though it is kind of a hassle considering I have to perform this thousands of times. This situation also create a problem when concatenating multiple netCDF files together (along time dim in my case). The discrepancies between the approximations of float32 values or the float32 vs float 64 situation will add new dimension values where it shouldn't. On the second part of my problem, it comes with writing/reading netCDF files (maybe more related to @daoudjahdou problem). I tried to change the data type to float64 for all my files, save them and then perform what I need to do, but for some reason even though dtype is float64 for all my dimensions when writing the files (using default args), it will sometime be float32, sometime float64 when reading the files (with default ags values) previously saved with float64 dtype. If using the default args, shouldn't the decoding makes the dtype of dimension the same for all files I read? |
Dear all and thank you for your work on Xarray, Link to @magau comment, I have a netcdf with multiple variables in different format (float, short, byte). Below an example of the 3 format from (ncdump -h):
And how they appear after opening in as xarray using open_mfdataset:
Is there any recommandation? |
I've run into this issue too, and the xarray decision to use The data value is 1395. val = int(1395)
scale = 0.0001
print(val*scale) # 0.1395
print( val * np.array(scale).astype(float) ) # 0.1395
print( val * np.array(scale).astype(np.float16) ) # 0.1395213...
print( val * np.array(scale).astype(np.float32) ) # 0.13949999...
print( val * np.array(scale).astype(np.float64) ) # 0.1395 Because we are using |
We'd happily take a PR implementing the suggestion above following CF-conventions.
IIUC the change should be made here in xarray/xarray/coding/variables.py Lines 266 to 283 in 392a614
|
This issue, based on its title and initial post, is fixed by PR #6851. The code to select dtype was already correct, but the outer function that called it had a bug in the call. Per the CF spec,
I find this is ambiguous. is The broader discussion here is about CF compliance. I find the spec ambiguous and xarray non-compliant. So many tests rely on the existing behavior, that I am unsure how best to proceed to improve compliance. I worry it may be a major refactor, and possibly break things relying on the existing behavior. I'd like to discuss architecture. Should this be in a new issue, if this closes with PR #6851? Should there be a new keyword for |
Yes, I'm pretty sure "float" means single precision (np.float32), given that "double" certainly means double precision (no.float64).
Yes, I believe so.
I think we can treat this a bug fix and just go forward with it. Yes, some people are going to be surprised, but I don't think it's distruptive enough that we need to go to a major effort to preserve backwards compatibility. It should already be straightforward to work around by setting |
Current algorithmdef _choose_float_dtype(dtype, has_offset):
"""Return a float dtype that can losslessly represent `dtype` values."""
# Keep float32 as-is. Upcast half-precision to single-precision,
# because float16 is "intended for storage but not computation"
if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating):
return np.float32
# float32 can exactly represent all integers up to 24 bits
if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer):
# A scale factor is entirely safe (vanishing into the mantissa),
# but a large integer offset could lead to loss of precision.
# Sensitivity analysis can be tricky, so we just use a float64
# if there's any offset at all - better unoptimised than wrong!
if not has_offset:
return np.float32
# For all other types and circumstances, we just use float64.
# (safe because eg. complex numbers are not supported in NetCDF)
return np.float64 Due to calling bug, def _choose_float_dtype(dtype)
if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating):
return np.float32
if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer):
return np.float32
return np.float64 Here I call the function twice, once with import numpy as np
def _choose_float_dtype(dtype, has_offset):
if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating):
return np.float32
if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer):
if not has_offset:
return np.float32
return np.float64
# generic types
for dtype in [np.byte, np.ubyte, np.short, np.ushort, np.intc, np.uintc, np.int_, np.uint, np.longlong, np.ulonglong,
np.half, np.float16, np.single, np.double, np.longdouble, np.csingle, np.cdouble, np.clongdouble,
np.int8, np.int16, np.int32, np.int64,
np.uint8, np.uint16, np.uint32, np.uint64,
np.float16, np.float32, np.float64]:
print("|", dtype, "|", _choose_float_dtype(np.dtype(dtype), False), "|", _choose_float_dtype(np.dtype(dtype), True), "|")
|
I think this means double is advised? If so, this should be stated. Should be rephrased to advise what to do (if there is one or only a few choices) rather than what not to do, or at least include that if not replacing current wording. |
Packing Qs
Unpacking Qs
|
Hello, I recently upgraded to the latest version of xarray. I noticed a change in the dtype, and it is likely related to this issue and associated PR. Briefly, I have a raster in "dtype": "<u2",
"units": "digital_counts",
"fill_value": 0,
"_FillValue": 0,
"scale_factor": 0.0001,
"add_offset": -0.1,
"valid_min": 1,
"valid_max": 65535, It is encoded as an 2-byte unsigned integer (uint16), and before was opened as a float32 with the default def _choose_float_dtype(
dtype: np.dtype, mapping: MutableMapping
) -> type[np.floating[Any]]: (_choose_float_dtype in the related PR #8713 Here is a screenshot of the followed path in the function: I also realize, that I relied on the previous behaviour of having The issue is that, since I specify both if scale_factor is not None:
scale_type = np.dtype(type(scale_factor))
if add_offset is not None:
offset_type = np.dtype(type(add_offset)) -> This will return the platform-dependant float (float64 for me), without checking the original size of the dtype. Maybe this is the correct behaviour, but in my case I would like to be able to choose to use For now, I will probably have to open my rasters with Versions
|
Code Sample :
Considering a netcdf file file with the following variable:
Code:
Problem description :
Different behaviour of xarray comparing to netCDF4 Dataset
When reading the dataset with xarray we found that the decoded type was numpy.float32 instead of numpy.float64
This netcdf variable has an int16 dtype
when the variable is read with the netCDF4 library directly, it is automatically converted to numpy.float64.
in our case we loose on precision when using xarray.
We found two solutions for this:
First solution :
This solution aims to prevent auto_maskandscale
Modification in xarray/backends/netCDF4_.py line 241
Second solution :
This solution uses numpy.float64 whatever integer type provided.
Modification in xarray/core/dtypes.py line 85
Solution number 2 would be great for us.
At this point we don't know if this modification would introduce some side effects.
Is there another way to avoid this problem ?
Expected Output
In our case we expect the variable to be in numpy.float64 as it is done by netCDF4.
Output of
xr.show_versions()
xarray: 0.10.8
pandas: 0.23.3
numpy: 1.15.0rc2
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: None
h5py: None
Nio: None
zarr: None
bottleneck: None
cyordereddict: None
dask: 0.18.1
distributed: 1.22.0
matplotlib: 2.2.2
cartopy: None
seaborn: None
setuptools: 40.0.0
pip: 10.0.1
conda: None
pytest: 3.6.3
IPython: 6.4.0
sphinx: None
The text was updated successfully, but these errors were encountered: