-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore ability to specify _FillValue as Python native integers #9258
Conversation
Any chance someone could look at this before the next release as the current state of things would break my (and I assume many) users workflows regarding |
try: | ||
# user provided the on-disk signed fill | ||
new_fill = signed_dtype.type(attrs["_FillValue"]) | ||
except OverflowError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit ill-defined it seems, so let's raise SerializationWarning
to encourage users to provide the signed on-disk value.
Do we have the except
clause only for potentially-wrong backwards compatibile behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it now, the except
clause is probably the common case as far as xarray is concerned. That is, data loaded with xarray will use the decode logic which will use the unsigned integer version of the fill value. However, for anyone constructing their own DataArray/Dataset objects, there is a good chance of providing the signed version of the fill value (the try
clause).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I should have said, my original thinking in the related issue was that _FillValue
should always be the on-disk type, but then realized it makes sense to have it match the in-memory loaded data type. I too would like there to be only one way of doing it, but I'm not sure how likely that is. I suppose the current except
clause should be the preferred way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decode logic will treat -1
and 255
the same for on-disk int8
type (as intended).
my original thinking in the related issue was that _FillValue should always be the on-disk type, but then realized it makes sense to have it match the in-memory loaded data type.
CF says _FillValue must be the on-disk type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood @kmuehlbauer, the in-memory type is needed for masking decoding to work, but also said that maybe the unsigned and masking should be combined. I could also see an argument for it being wrong to have a _FillValue
that doesn't work for and match the data it is attached to (in-memory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct for the encoding pipeline. I can add a serialization warning for the case where the provided _FillValue
is not already the on-disk signed type.
For the decoding, you say that it already overwrites the _FillValue
after casting to unsigned. Which part overwrites _FillValue
? The CFMaskCoder
uses the unsigned _FillValue
and retains it in .encoding["_FillValue"]
from my tests with a real NetCDF file. I'm also just realizing that xarray .open_dataset
always converts a pure integer variable to a floating point variable and replaces fills with NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a serialization warning for the case where the provided _FillValue is not already the on-disk signed type.
Won't this warn for python integers though?
The CFMaskCoder uses the unsigned _FillValue and retains it in .encoding["_FillValue"] from my tests with a real NetCDF file.
Ah yes this is correct, we don't reset the type after masking and that would be why we want to combine UnsignedIntegerCoder and CFMaskCoder.
I think we can combine these in a future PR
I'm also just realizing that xarray .open_dataset always converts a pure integer variable to a floating point variable and replaces fills with NaN.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a serialization warning for the case where the provided _FillValue is not already the on-disk signed type.
Won't this warn for python integers though?
To be clear this would be in the encode pipeline and would only happen if the fill provided doesn't fit into the on-disk dtype.
In [9]: np.dtype(np.int8).type(255)
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
Cell In[9], line 1
----> 1 np.dtype(np.int8).type(255)
OverflowError: Python integer 255 out of bounds for int8
In [10]: np.dtype(np.int8).type(-1)
Out[10]: np.int8(-1)
In [11]: np.__version__
Out[11]: '2.0.0'
So 255 -> int8 is error, -1 -> int8 is fine.
BUT this means anyone doing a xr.open_dataset
followed by a .to_netcdf
will get the serialization warning with the logic as is (main
and this PR) because the _FillValue
was made unsigned in the decode pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUT this means anyone doing a xr.open_dataset followed by a .to_netcdf will get the serialization warning with the logic as is (main and this PR) because the _FillValue was made unsigned in the decode pipeline.
OK good argument.
It seems to me like this is good to go then? And we can follow up with a PR that merges UnsignedCoder and MaskCoder, and handles the FillValue properly. Does that sound right to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the new serialization warning, yes, I think this is as I expect. Right now this PR should at least restore behavior. If anything, especially based on this discussion, this code is too flexible...but that's kind of how anyone claiming they make CF-compliant NetCDF files assumes things will work.
I'm going to merge to unblock the release. Right now roundtripping with |
* main: (54 commits) Adding `open_datatree` backend-specific keyword arguments (#9199) [pre-commit.ci] pre-commit autoupdate (#9202) Restore ability to specify _FillValue as Python native integers (#9258) add backend intro and how-to diagram (#9175) Fix copybutton for multi line examples in double digit ipython cells (#9264) Update signature for _arrayfunction.__array__ (#9237) Add encode_cf_datetime benchmark (#9262) groupby, resample: Deprecate some positional args (#9236) Delete ``base`` and ``loffset`` parameters to resample (#9233) Update dropna docstring (#9257) Grouper, Resampler as public api (#8840) Fix mypy on main (#9252) fix fallback isdtype method (#9250) Enable pandas type checking (#9213) Per-variable specification of boolean parameters in open_dataset (#9218) test push Added a space to the documentation (#9247) Fix typing for test_plot.py (#9234) Allow mypy to run in vscode (#9239) Revert "Test main push" ...
…monotonic-variable * main: (995 commits) Adding `open_datatree` backend-specific keyword arguments (pydata#9199) [pre-commit.ci] pre-commit autoupdate (pydata#9202) Restore ability to specify _FillValue as Python native integers (pydata#9258) add backend intro and how-to diagram (pydata#9175) Fix copybutton for multi line examples in double digit ipython cells (pydata#9264) Update signature for _arrayfunction.__array__ (pydata#9237) Add encode_cf_datetime benchmark (pydata#9262) groupby, resample: Deprecate some positional args (pydata#9236) Delete ``base`` and ``loffset`` parameters to resample (pydata#9233) Update dropna docstring (pydata#9257) Grouper, Resampler as public api (pydata#8840) Fix mypy on main (pydata#9252) fix fallback isdtype method (pydata#9250) Enable pandas type checking (pydata#9213) Per-variable specification of boolean parameters in open_dataset (pydata#9218) test push Added a space to the documentation (pydata#9247) Fix typing for test_plot.py (pydata#9234) Allow mypy to run in vscode (pydata#9239) Revert "Test main push" ...
See #9136 for the longer discussion of this. In #9136 numpy 2 compatibility was fixed by working around integer overflow errors when a
_FillValue
was specified in an unsigned dtype (ex.np.uint8(255)
) but needed to be cast to the signed dtype to be stored to a NetCDF file. This is all related to the use of the special_Unsigned
attribute. However, the referenced PR broke the ability to specify_FillValue
with native python integers (ex.-1
).This PR restores this behavior by handling both primary use cases:
_FillValue
corresponds to the unsigned in-memory data_FillValue
corresponds to the signed on-disk dataI'm sure there are at least 10 ways of interpreting what the best solution should be. This one I think causes the least amount of disruption and restores previous behavior as far as I can tell.
CC @dcherian @kmuehlbauer
whats-new.rst
api.rst