-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix critical literal np.timedelta64
encoding bugs
#10469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Spencer!
# Overwrite the on-disk dtype encoding, which is numeric, with | ||
# the dtype attribute stored on disk, which corresponds to | ||
# a timedelta64 dtype. | ||
encoding["dtype"] = attrs.pop("dtype") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to set datetime64
here as the encoded dtype, because this information is preserved in the dtype of the resulting Variable
. We should leave encoding['dtype']
as the raw numeric dtype on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on this a bit in the initial PR. Would we be open to renaming the attribute that encodes the timedelta64
dtype "xarray_dtype"
? This way we would not need to worry about overwriting the numeric dtype in the encoding
and can still retain information about the timedelta64
encoding method used.
I think it is potentially useful, at least while we still have two different methods of decoding timedelta64
values, to be able to distinguish between them for round-tripping purposes (i.e. so that we do not add new dtype attributes to files that did not have them previously). Maybe that should not be a priority though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is potentially useful, at least while we still have two different methods of decoding
timedelta64
values, to be able to distinguish between them for round-tripping purposes (i.e. so that we do not add new dtype attributes to files that did not have them previously). Maybe that should not be a priority though.
I think users probably do want new dtype
attributes saved if they are re-writing data with Xarray. This makes their datasets forward compatible with future versions of Xarray, and silences the deprecation warning about decoding changing in the future, in a way that feels cleaner to me than adding decode_timedelta=True
to every open_dataset()
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok great, that should simplify things.
timedeltas = np.array( | ||
[0, np.iinfo(np.int32).min, np.iinfo(np.int64).min] | ||
).astype("timedelta64[s]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest doing this test with NaT
as well (or instead)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this test covers that since np.iinfo(np.int64).min
is NaT
. I'll write the constructor like this instead so that becomes clearer:
timedeltas = np.array([0, np.iinfo(np.int32).min, "NaT"], dtype="timedelta64[s]")
raise ValueError( | ||
f"Could not safely coerce default int64 _FillValue " | ||
f"({default_int64_fill_value}) to the analogous int32 " | ||
f"value ({default_int32_fill_value}), since it " | ||
f"already exists as non-missing within variable " | ||
f"{name!r}. Try explicitly setting " | ||
f"encoding['_FillValue'] to another int32 value." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this error. We already have a check in coerce_nc3_dtype
that the encoded int32 values match the original int64 values exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is needed to check something different, i.e. that the new fill value does not conflict with an existing value in the array. I believe the downstream check in coerce_nc3_dtype
would pass in that circumstance since we replace the int64 fill value with the new int32 fill value upstream of it.
One other strategic question -- do we need this separate logic for decoding into an integer dtype that exactly matches the precision of the datetime64 units? Lines 1456 to 1487 in 4a581f4
I am wondering because it seems like this logic is already covered in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick feedback @shoyer!
I am wondering because it seems like this logic is already covered in
encode_cf_timedelta()
, and adding this explicit check makes means thattimedelta64[ns]
data cannot be saved directly to netCDF3 files unless the units are converted first.timedelta64[ns]
worked viaencode_cf_timedelta
as long as nanosecond precision isn't really needed, similar to howdatetime64[ns]
is OK. This was convenient for users of "human" time units.
I agree this is annoying, and it seems like we should be able to relax it some. I'll think about it some more and see what I can do.
raise ValueError( | ||
f"Could not safely coerce default int64 _FillValue " | ||
f"({default_int64_fill_value}) to the analogous int32 " | ||
f"value ({default_int32_fill_value}), since it " | ||
f"already exists as non-missing within variable " | ||
f"{name!r}. Try explicitly setting " | ||
f"encoding['_FillValue'] to another int32 value." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is needed to check something different, i.e. that the new fill value does not conflict with an existing value in the array. I believe the downstream check in coerce_nc3_dtype
would pass in that circumstance since we replace the int64 fill value with the new int32 fill value upstream of it.
# Overwrite the on-disk dtype encoding, which is numeric, with | ||
# the dtype attribute stored on disk, which corresponds to | ||
# a timedelta64 dtype. | ||
encoding["dtype"] = attrs.pop("dtype") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on this a bit in the initial PR. Would we be open to renaming the attribute that encodes the timedelta64
dtype "xarray_dtype"
? This way we would not need to worry about overwriting the numeric dtype in the encoding
and can still retain information about the timedelta64
encoding method used.
I think it is potentially useful, at least while we still have two different methods of decoding timedelta64
values, to be able to distinguish between them for round-tripping purposes (i.e. so that we do not add new dtype attributes to files that did not have them previously). Maybe that should not be a priority though.
timedeltas = np.array( | ||
[0, np.iinfo(np.int32).min, np.iinfo(np.int64).min] | ||
).astype("timedelta64[s]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this test covers that since np.iinfo(np.int64).min
is NaT
. I'll write the constructor like this instead so that becomes clearer:
timedeltas = np.array([0, np.iinfo(np.int32).min, "NaT"], dtype="timedelta64[s]")
This PR fixes the critical
np.timedelta64
encoding bugs introduced in #10101. I added a check to ensure that modifying the default_FillValue
does not inadvertently mark non-missing values as missing as well.whats-new.rst
cc: @shoyer @kmuehlbauer; sorry again for the trouble.