-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to force float64 instead of float32 issue #2304 #2751
base: main
Are you sure you want to change the base?
Conversation
When dealing with int variables with scale factor xarray transform them to float32 by default. This enhancement aims to give the possibity to force that to float64 when needed. The behaviour is unchanged when not specifying the parameter force_promote_float64.
@shoyer did you have a look at this ? |
@daoudjahdou sorry I missed your comment over in #2304. Threading the new parameter down though other functions is definitely the preferred approach here -- mutating global variables makes it much harder to follow control flow. Though I would probably stop short of modifying |
@shoyer the logic is now propagated down to _choose_float_dtype inside CFScaleOffsetCoder, please let me know what you think. |
@daoudjahdou what do you think of @magau's proposed solution over in #2304 (comment)? Checking the dtypes of of |
@shoyer yes sure i'll update the pull request with the mentioned modifications. |
@shoyer now scale_factor and add_offset are taken into account when encoding and decoding data. |
dtype = _choose_float_dtype(data.dtype, 'add_offset' in encoding) | ||
scale_factor = pop_to(attrs, encoding, 'scale_factor', name=name) | ||
add_offset = pop_to(attrs, encoding, 'add_offset', name=name) | ||
dtype = _choose_float_dtype(data.dtype, scale_factor, add_offset) | ||
data = data.astype(dtype=dtype, copy=True) | ||
if 'add_offset' in encoding: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line (and the line below for scale_factor
) needs to be updated to use the add_offset
variable if it isn't None
rather than checking for it in encoding
again. (pop_to
removes an item from the dict in which it's found.)
# from the variable (containing the packed data) | ||
# then the unpacked data should match | ||
# the type of these attributes, which must both be of type float or both | ||
# be of type double. An additional restriction in this case is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we do if add_offset
/scale_factor
are not floats?
Right now in your implementation, if scale_factor
is mistakenly np.int64(10)
then the data would get decoded as int64, which seems like a bad idea (e.g., if the data is some float type). I think we would should probably ignore the type of scale_factor
/add_offset
in these cases instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shoyer! Sorry for my delayed intervenience... First of all I'd like to thank and congratulate you for your efficiency and say that I'm very glad to see that you've taken into account my considerations about this topic.
About the comment above I'd say that, since the CFScaleOffsetCoder will always multiply the raw data by the scale_factor, in case the scale_factor is an integer the decoded values will always result into integers too (unless the raw data dtype is float, as you've mentioned).
Even in these cases I'd return the decoded values with the same dtype as the scale_factor / add_offset, which is the expected behavior (data providers must be aware...). Users can always use open_dataset with decode_cf=False and call the decode_cf_variable function later, after fixing the variable.attrs['scale_factor / add_offset'] dtypes.
A possible solution is add the scale_factor and add_offset as arguments of the open_dataset function, to overwrite the original ones, which wold receive labeled values like: scale_factor={'var1': whatever, 'var2': whatever}, add_offset: {'var1': whatever, 'var2': whatever}
. This wold also resolve the #2304 issue (with a better approach, in my opinion).
By the way, I've also noted that the CFScaleOffsetCoder.encode changes the values.dtype before applying the inverse operation (i.e the dividing by the scale_factor), this will result in truncated values. I've already faced this situation, with my own netcdf encoding processes, and would advice you to only change the dtype after applying the transformation and use the numpy.round function (for integers encoding) to avoid the truncation artifacts.
Once again, sorry for my extensive comments and not opening an issue, as I should, but my intention is that you evaluate my considerations and take your own decisions with your colleagues, and keep the nice work you all have being done. 👍
Cheers!
xarray/coding/variables.py
Outdated
# We first return scale_factor type | ||
# as multiplying takes priority over | ||
# adding values | ||
if dtype is not type(scale_factor) and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need to check whether the type of scale_factor
matches dtype
? I think the logic works the same either way.
xarray/coding/variables.py
Outdated
# as multiplying takes priority over | ||
# adding values | ||
if dtype is not type(scale_factor) and \ | ||
isinstance(scale_factor, np.generic): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: per PEP8, please prefer using parentheses ()
to split across multiple lines instead of \
@@ -1156,6 +1157,25 @@ def test_mask_and_scale(self): | |||
expected = create_masked_and_scaled_data() | |||
assert_identical(expected, ds) | |||
|
|||
def test_mask_and_scale_with_float64_scale_factor(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to test all of the edge cases just using CFScaleOffsetCoder
, e.g.,
- what happens if
add_offset
andscale_factor
have different types? - what if they're integers instead of floats?
See the tests in tests/test_coding.py
for examples.
xarray/coding/variables.py
Outdated
|
||
# We first return scale_factor type | ||
# as multiplying takes priority over | ||
# adding values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous about making this assumption. The NumPy casting rules for a * x + b
would give the result the largest dtype of any of a
, x
and b
. Maybe we should similarly pick whichever of scale_factor
, data
and add_offset
has the largest float dtype?
@shoyer i changed the implementation, and took into consideration your comments. |
@shoyer did you have a look at this ? |
xarray/coding/variables.py
Outdated
np.dtype(type(add_offset)), | ||
np.dtype(dtype)] | ||
# scaled_type should be the largest type we find | ||
scaled_dtype = max(types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use xarray.core.dtypes.result_type()
here instead of max()
.
I don't know how NumPy defines comparison between dtypes and I can't find it documented anywhere, so I'm a little nervous about relying on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer i've tested it with all numpy dtypes pair combinations and it turns out that it always gives the largest, i can for sure modify it
xarray/coding/variables.py
Outdated
# It is not advised to unpack an int into a float as there | ||
# is a potential precision loss. | ||
|
||
if mode is 'decoding': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a mode
argument, could you write two separate functions, _choose_encoding_float_dtype
and _choose_decoding_float_dtype
? It would be OK for one to call the other, but mode
arguments are mild code smell.
xarray/tests/test_backends.py
Outdated
@@ -685,7 +685,9 @@ def test_roundtrip_mask_and_scale(self, decoded_fn, encoded_fn): | |||
with self.roundtrip(decoded) as actual: | |||
for k in decoded.variables: | |||
assert (decoded.variables[k].dtype | |||
== actual.variables[k].dtype) | |||
== actual.variables[k].dtype or | |||
(decoded.variables[k].dtype == np.float32 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why this test needs to change. Can you give an example of what behavior changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer
Here is my understanding of what xarray does :
In all cases encoding returns either np.float32 or np.float64
Encoding only cares about existence of add_offset when the variable is an integer
If the variable is a float then the encoded dtype is np.float32
If the variable is an integer with add_offset then the encoded dtype np.float64
If the variable is an integer with no add_affset the encoded dtype np.float32
In all other cases the encoded dtype is np.float64
Here is my understanding of what cf-conventions imply:
decoding is the equivalent of unpacking mentioned in the cf-convention
in all cases decoding takes the encoded dtype which is either np.float32 or np.float64
then decoded type is the largest between scale_factor, add_offset and encoded type and not original variable
That being said in that test there are cases where
original type, scale_factor, add_offset are on the following configuration:
np.float32, np.int64, np.float32
then the encoded type is np.float32 as the variable is np.float32
when decoding the largest (dtypes.result_type()) between the three (encoded, sf, ao) is np.float64
and that's why "decoded" (orignal) in the test is np.float32 and actual (roundtripped) is np.float64
In my next commit i took into consideration your remark and i wrote every possible case and what to expect as an encoded or decoded dtype.
xarray/tests/test_coding.py
Outdated
# We make sure that roundtripped is larger than | ||
# the original | ||
assert roundtripped.dtype.itemsize >= original.dtype.itemsize | ||
assert (roundtripped.dtype is np.dtype(np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's safer to use ==
instead of is
for comparing dtype objects
xarray/tests/test_coding.py
Outdated
np.int64(10), np.uint8(10), | ||
np.uint16(10), np.uint32(10), | ||
np.uint64(10), np.uint64(10)]) | ||
def test_scaling_according_to_cf_convention(dtype, scale_factor, add_offset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great set of invariant checks! But it would still be nice to see a more specific list of unit tests verifying that the decoded dtype is exactly as expected for a set of known inputs, e.g.,
@pytest.mark.parametrize(
('input_dtype', 'scale_factor', 'add_offset', 'expected_dtype'),
[
(np.float32, 1, 0, np.float32),
(np.float32, 1, 10, np.float64),
...
]
)
def test_cfscaleoffset_decoding_dtype(input_dtype, scale_factor, add_offset, expected_dtype):
original = xr.Variable(...)
decoded = variables.CFScaleOffsetCoder().decode(original)
assert decoded.dtype == expected_dtype
This would us confidence that it's handling all the edge cases properly, e.g., only giving a bigger dtype when truly warranted.
@shoyer tests are failing but it does'nt seem to be coming from this PR, i saw the same error on other PRs as well, my tests were working fine until i made a git pull. |
@daoudjahdou thanks for sticking with this. I'm concerned about how lose the "roundtrip" invariant in some edge cases, where the user did not specifically indicate the desired dtype but rather used a Python type. Right now xarray seems to use inconsistent rules for casting attribute types, based on the netCDF backend:
At least for |
@shoyer do you mean that we consider that by default when we deal with Python's |
Currently we cast to That default behavior doesn't really make sense if the user provided builtin Python numbers for these attributes. For example, if I write something like the following, the data for ds = xarray.Dataset({'x': np.float32(0)})
encoding = {'x': {'scale_factor': 1, 'add_offset': 1, 'dtype': 'int8'}}
ds.to_netcdf('test.nc', encoding=encoding) With the current implementation in this PR, I think the data will come back as float64. To get something that will decode as float64 without the original data being float64, the user should need to intentionally choose dtypes for ds = xarray.Dataset({'x': np.float32(0)})
encoding = {'x': {'scale_factor': np.float64(1), 'add_offset': np.float64(1), 'dtype': 'int8'}}
ds.to_netcdf('test.nc', encoding=encoding) Does this make sense to you? From my perspective, two goals for this change should be:
|
@shoyer sorry for the delayed response. |
Right, we definitely don't want to inspect the data to verify if it fits.
|
@@ -217,9 +258,9 @@ class CFScaleOffsetCoder(VariableCoder): | |||
|
|||
def encode(self, variable, name=None): | |||
dims, data, attrs, encoding = unpack_for_encoding(variable) | |||
|
|||
if 'scale_factor' in encoding or 'add_offset' in encoding: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative logic for this clause that I think might work:
if 'scale_factor' in encoding or 'add_offset' in encoding:
scale_factor = encoding.pop('scale_factor', 1)
add_offset = encoding.pop('add_offset, 0)
if 'dtype' in encoding:
dtype = pop_to(encoding, attrs, 'dtype', name=name)
else:
dtype = _choose_encoding_float_dtype(data.dtype, add_offset != 0)
# save scale_factor and add_offset with dtype matching the decoded
# data, per CF conventions
safe_setitem(attrs, 'scale_factor', data.dtype.type(scale_factor), name)
safe_setitem(attrs, 'add_offset', data.dtype.type(add_offset), name)
# apply scale/offset encoding
data = data.astype(dtype=dtype, copy=True)
if add_offset:
data -= add_offset
if scale_factor != 1:
# multiplication is faster than division
data *= 1/scale_factor
The idea here is to always write the scale_factor
and add_offset
attributes according to CF conventions, which ensures that data always gets read out the right way. This does remove some user flexibility, but xarray is OK with being opinionated about how it writes metadata -- it is not a general purpose tool for writing completely flexible netCDF files.
|
||
if add_offset is None: | ||
types = (np.dtype(type(scale_factor)), | ||
np.dtype(dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if these five lines creating types
handled add_offset and scale_factor in a symmetric way, e.g.,
types = [np.dtype(dtype)]
if scale_factor is not None:
types.append(np.dtype(type(scale_offset))
if add_offset is not None:
types.append(np.dtype(type(add_offset))
# It is not advised to unpack an int into a float as there | ||
# is a potential precision loss. | ||
|
||
if scale_factor or add_offset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safer to use is not None
for checking default values:
if scale_factor or add_offset: | |
if scale_factor is not None or add_offset is not None: |
and np.issubdtype(scaled_dtype, np.floating)): | ||
return scaled_dtype | ||
|
||
return _choose_encoding_float_dtype(dtype, add_offset is not None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially we could use the add_offset
value here -- 0 and None are equivalent as far as picking an automatic float dtype is concerned, indicating "no offset":
return _choose_encoding_float_dtype(dtype, add_offset is not None) | |
return _choose_encoding_float_dtype(dtype, add_offset) |
@shoyer i've tested the solution provided, it works like a charm with my tests however many tests are broken on test_backends.py with cases where we lose precision i'll give you more detail. |
I've noted that the very recent ERA5 data is also cast to float32 at decoding, although the scale and offset params are float64. I think it would be a very valuable addition to xarray to do this properly: @daoudjahdou are you still willing to work on this PR? Otherwise I'll try to pick this up when time permits. |
Ability to promote to float64 instead of float32 when dealing with int variables with scale_factor.
added parameter force_promote_float64 (False by default) to open_dataset and open_dataarray that enables this behaviour when True.
whats-new.rst
for all changes andapi.rst
for new API