-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to force float64 instead of float32 issue #2304 #2751
base: main
Are you sure you want to change the base?
Changes from all commits
fef5fe7
093a338
47d4207
3804d56
0142546
5b421c8
c1a2a03
b05193d
b42c744
a94f2af
c394509
99599ed
1de0884
638332c
41da205
29d191a
c535df2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -69,3 +69,4 @@ xarray/version.py | |
Icon* | ||
|
||
.ipynb_checkpoints | ||
.nfs* |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -189,8 +189,49 @@ def _scale_offset_decoding(data, scale_factor, add_offset, dtype): | |||||
return data | ||||||
|
||||||
|
||||||
def _choose_float_dtype(dtype, has_offset): | ||||||
def _choose_decoding_float_dtype(dtype, scale_factor, add_offset): | ||||||
"""Return a float dtype according to cf-convention""" | ||||||
# Implementing cf-convention | ||||||
# http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html: | ||||||
# Detail: | ||||||
# If the scale_factor and add_offset attributes are of the same | ||||||
# data type as the | ||||||
# associated variable, the unpacked data is assumed to be of the same | ||||||
# data type as the packed data. However, if the scale_factor | ||||||
# and add_offset attributes are of a different data type | ||||||
# from the variable (containing the packed data) | ||||||
# then the unpacked data should match | ||||||
# the type of these attributes, which must both be of type float or both | ||||||
# be of type double. An additional restriction in this case is that | ||||||
# the variable containing the packed data must | ||||||
# be of type byte, short or int. | ||||||
# It is not advised to unpack an int into a float as there | ||||||
# is a potential precision loss. | ||||||
|
||||||
if scale_factor or add_offset: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's safer to use
Suggested change
|
||||||
|
||||||
types = (np.dtype(type(scale_factor)), | ||||||
np.dtype(type(add_offset)), | ||||||
np.dtype(dtype)) | ||||||
|
||||||
if add_offset is None: | ||||||
types = (np.dtype(type(scale_factor)), | ||||||
np.dtype(dtype)) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be nice if these five lines creating types = [np.dtype(dtype)]
if scale_factor is not None:
types.append(np.dtype(type(scale_offset))
if add_offset is not None:
types.append(np.dtype(type(add_offset)) |
||||||
|
||||||
# scaled_type should be the largest type we find | ||||||
scaled_dtype = dtypes.result_type(*types) | ||||||
|
||||||
# We return it only if it's a float32 or a float64 | ||||||
if (scaled_dtype.itemsize >= 4 | ||||||
and np.issubdtype(scaled_dtype, np.floating)): | ||||||
return scaled_dtype | ||||||
|
||||||
return _choose_encoding_float_dtype(dtype, add_offset is not None) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Potentially we could use the
Suggested change
|
||||||
|
||||||
|
||||||
def _choose_encoding_float_dtype(dtype, has_offset): | ||||||
"""Return a float dtype that can losslessly represent `dtype` values.""" | ||||||
|
||||||
# Keep float32 as-is. Upcast half-precision to single-precision, | ||||||
# because float16 is "intended for storage but not computation" | ||||||
if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): | ||||||
|
@@ -217,9 +258,9 @@ class CFScaleOffsetCoder(VariableCoder): | |||||
|
||||||
def encode(self, variable, name=None): | ||||||
dims, data, attrs, encoding = unpack_for_encoding(variable) | ||||||
|
||||||
if 'scale_factor' in encoding or 'add_offset' in encoding: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternative logic for this clause that I think might work: if 'scale_factor' in encoding or 'add_offset' in encoding:
scale_factor = encoding.pop('scale_factor', 1)
add_offset = encoding.pop('add_offset, 0)
if 'dtype' in encoding:
dtype = pop_to(encoding, attrs, 'dtype', name=name)
else:
dtype = _choose_encoding_float_dtype(data.dtype, add_offset != 0)
# save scale_factor and add_offset with dtype matching the decoded
# data, per CF conventions
safe_setitem(attrs, 'scale_factor', data.dtype.type(scale_factor), name)
safe_setitem(attrs, 'add_offset', data.dtype.type(add_offset), name)
# apply scale/offset encoding
data = data.astype(dtype=dtype, copy=True)
if add_offset:
data -= add_offset
if scale_factor != 1:
# multiplication is faster than division
data *= 1/scale_factor The idea here is to always write the |
||||||
dtype = _choose_float_dtype(data.dtype, 'add_offset' in encoding) | ||||||
dtype = _choose_encoding_float_dtype(data.dtype, | ||||||
'add_offset' in encoding) | ||||||
data = data.astype(dtype=dtype, copy=True) | ||||||
if 'add_offset' in encoding: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line (and the line below for |
||||||
data -= pop_to(encoding, attrs, 'add_offset', name=name) | ||||||
|
@@ -232,9 +273,12 @@ def decode(self, variable, name=None): | |||||
dims, data, attrs, encoding = unpack_for_decoding(variable) | ||||||
|
||||||
if 'scale_factor' in attrs or 'add_offset' in attrs: | ||||||
|
||||||
scale_factor = pop_to(attrs, encoding, 'scale_factor', name=name) | ||||||
add_offset = pop_to(attrs, encoding, 'add_offset', name=name) | ||||||
dtype = _choose_float_dtype(data.dtype, 'add_offset' in attrs) | ||||||
dtype = _choose_decoding_float_dtype(data.dtype, | ||||||
scale_factor, add_offset) | ||||||
|
||||||
transform = partial(_scale_offset_decoding, | ||||||
scale_factor=scale_factor, | ||||||
add_offset=add_offset, | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -685,7 +685,9 @@ def test_roundtrip_mask_and_scale(self, decoded_fn, encoded_fn): | |
with self.roundtrip(decoded) as actual: | ||
for k in decoded.variables: | ||
assert (decoded.variables[k].dtype | ||
== actual.variables[k].dtype) | ||
== actual.variables[k].dtype | ||
or (decoded.variables[k].dtype == np.float32 and | ||
actual.variables[k].dtype == np.float64)) | ||
assert_allclose(decoded, actual, decode_bytes=False) | ||
|
||
with self.roundtrip(decoded, | ||
|
@@ -711,7 +713,9 @@ def test_roundtrip_mask_and_scale(self, decoded_fn, encoded_fn): | |
with self.roundtrip(encoded) as actual: | ||
for k in decoded.variables: | ||
assert (decoded.variables[k].dtype == | ||
actual.variables[k].dtype) | ||
actual.variables[k].dtype or | ||
(decoded.variables[k].dtype == np.float32 and | ||
actual.variables[k].dtype == np.float64)) | ||
assert_allclose(decoded, actual, decode_bytes=False) | ||
|
||
def test_coordinates_encoding(self): | ||
|
@@ -1156,15 +1160,16 @@ def test_mask_and_scale(self): | |
nc.createVariable('x', 'int16', ('t',), fill_value=-1) | ||
v = nc.variables['x'] | ||
v.set_auto_maskandscale(False) | ||
v.add_offset = 10 | ||
v.scale_factor = 0.1 | ||
v.add_offset = np.float32(10) | ||
v.scale_factor = np.float32(0.1) | ||
v[:] = np.array([-1, -1, 0, 1, 2]) | ||
|
||
# first make sure netCDF4 reads the masked and scaled data | ||
# correctly | ||
with nc4.Dataset(tmp_file, mode='r') as nc: | ||
expected = np.ma.array([-1, -1, 10, 10.1, 10.2], | ||
mask=[True, True, False, False, False]) | ||
mask=[True, True, False, False, False], | ||
dtype=np.float32) | ||
actual = nc.variables['x'][:] | ||
assert_array_equal(expected, actual) | ||
|
||
|
@@ -1173,6 +1178,25 @@ def test_mask_and_scale(self): | |
expected = create_masked_and_scaled_data() | ||
assert_identical(expected, ds) | ||
|
||
def test_mask_and_scale_with_float64_scale_factor(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be nice to test all of the edge cases just using
See the tests in |
||
with create_tmp_file() as tmp_file: | ||
with nc4.Dataset(tmp_file, mode='w') as nc: | ||
nc.createDimension('t', 5) | ||
nc.createVariable('x', 'int16', ('t',), fill_value=-1) | ||
v = nc.variables['x'] | ||
v.scale_factor = 0.01 | ||
v.add_offset = 10 | ||
v[:] = np.array([-1123, -1123, 123, 1123, 2123]) | ||
# We read the newly created netcdf file | ||
with nc4.Dataset(tmp_file, mode='r') as nc: | ||
# we open the dataset | ||
with open_dataset(tmp_file) as ds: | ||
# Both dataset values should be equal | ||
# And both of float64 array type | ||
dsv = ds['x'].values | ||
ncv = nc.variables['x'][:] | ||
np.testing.assert_array_almost_equal(dsv, ncv, 15) | ||
|
||
def test_0dimensional_variable(self): | ||
# This fix verifies our work-around to this netCDF4-python bug: | ||
# https://github.com/Unidata/netcdf4-python/pull/220 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we do if
add_offset
/scale_factor
are not floats?Right now in your implementation, if
scale_factor
is mistakenlynp.int64(10)
then the data would get decoded as int64, which seems like a bad idea (e.g., if the data is some float type). I think we would should probably ignore the type ofscale_factor
/add_offset
in these cases instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shoyer! Sorry for my delayed intervenience... First of all I'd like to thank and congratulate you for your efficiency and say that I'm very glad to see that you've taken into account my considerations about this topic.
About the comment above I'd say that, since the CFScaleOffsetCoder will always multiply the raw data by the scale_factor, in case the scale_factor is an integer the decoded values will always result into integers too (unless the raw data dtype is float, as you've mentioned).
Even in these cases I'd return the decoded values with the same dtype as the scale_factor / add_offset, which is the expected behavior (data providers must be aware...). Users can always use open_dataset with decode_cf=False and call the decode_cf_variable function later, after fixing the variable.attrs['scale_factor / add_offset'] dtypes.
A possible solution is add the scale_factor and add_offset as arguments of the open_dataset function, to overwrite the original ones, which wold receive labeled values like:
scale_factor={'var1': whatever, 'var2': whatever}, add_offset: {'var1': whatever, 'var2': whatever}
. This wold also resolve the #2304 issue (with a better approach, in my opinion).By the way, I've also noted that the CFScaleOffsetCoder.encode changes the values.dtype before applying the inverse operation (i.e the dividing by the scale_factor), this will result in truncated values. I've already faced this situation, with my own netcdf encoding processes, and would advice you to only change the dtype after applying the transformation and use the numpy.round function (for integers encoding) to avoid the truncation artifacts.
Once again, sorry for my extensive comments and not opening an issue, as I should, but my intention is that you evaluate my considerations and take your own decisions with your colleagues, and keep the nice work you all have being done. 👍
Cheers!