Skip to content

Fix critical literal np.timedelta64 encoding bugs #10469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

spencerkclark
Copy link
Member

This PR fixes the critical np.timedelta64 encoding bugs introduced in #10101. I added a check to ensure that modifying the default _FillValue does not inadvertently mark non-missing values as missing as well.

cc: @shoyer @kmuehlbauer; sorry again for the trouble.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Spencer!

# Overwrite the on-disk dtype encoding, which is numeric, with
# the dtype attribute stored on disk, which corresponds to
# a timedelta64 dtype.
encoding["dtype"] = attrs.pop("dtype")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to set datetime64 here as the encoded dtype, because this information is preserved in the dtype of the resulting Variable. We should leave encoding['dtype'] as the raw numeric dtype on disk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this a bit in the initial PR. Would we be open to renaming the attribute that encodes the timedelta64 dtype "xarray_dtype"? This way we would not need to worry about overwriting the numeric dtype in the encoding and can still retain information about the timedelta64 encoding method used.

I think it is potentially useful, at least while we still have two different methods of decoding timedelta64 values, to be able to distinguish between them for round-tripping purposes (i.e. so that we do not add new dtype attributes to files that did not have them previously). Maybe that should not be a priority though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is potentially useful, at least while we still have two different methods of decoding timedelta64 values, to be able to distinguish between them for round-tripping purposes (i.e. so that we do not add new dtype attributes to files that did not have them previously). Maybe that should not be a priority though.

I think users probably do want new dtype attributes saved if they are re-writing data with Xarray. This makes their datasets forward compatible with future versions of Xarray, and silences the deprecation warning about decoding changing in the future, in a way that feels cleaner to me than adding decode_timedelta=True to every open_dataset() call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great, that should simplify things.

Comment on lines +356 to +358
timedeltas = np.array(
[0, np.iinfo(np.int32).min, np.iinfo(np.int64).min]
).astype("timedelta64[s]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest doing this test with NaT as well (or instead)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this test covers that since np.iinfo(np.int64).min is NaT. I'll write the constructor like this instead so that becomes clearer:

timedeltas = np.array([0, np.iinfo(np.int32).min, "NaT"], dtype="timedelta64[s]")

Comment on lines +121 to +128
raise ValueError(
f"Could not safely coerce default int64 _FillValue "
f"({default_int64_fill_value}) to the analogous int32 "
f"value ({default_int32_fill_value}), since it "
f"already exists as non-missing within variable "
f"{name!r}. Try explicitly setting "
f"encoding['_FillValue'] to another int32 value."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this error. We already have a check in coerce_nc3_dtype that the encoded int32 values match the original int64 values exactly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is needed to check something different, i.e. that the new fill value does not conflict with an existing value in the array. I believe the downstream check in coerce_nc3_dtype would pass in that circumstance since we replace the int64 fill value with the new int32 fill value upstream of it.

@shoyer
Copy link
Member

shoyer commented Jun 30, 2025

One other strategic question -- do we need this separate logic for decoding into an integer dtype that exactly matches the precision of the datetime64 units?

xarray/xarray/coding/times.py

Lines 1456 to 1487 in 4a581f4

else:
resolution, _ = np.datetime_data(variable.dtype)
dtype = np.int64
attrs_dtype = f"timedelta64[{resolution}]"
units = _numpy_dtype_to_netcdf_timeunit(variable.dtype)
safe_setitem(attrs, "dtype", attrs_dtype, name=name)
# Remove dtype encoding if it exists to prevent it from
# interfering downstream in NonStringCoder.
encoding.pop("dtype", None)
if any(
k in encoding for k in _INVALID_LITERAL_TIMEDELTA64_ENCODING_KEYS
):
raise ValueError(
f"Specifying 'add_offset' or 'scale_factor' is not "
f"supported when encoding the timedelta64 values of "
f"variable {name!r} with xarray's new default "
f"timedelta64 encoding approach. To encode {name!r} "
f"with xarray's previous timedelta64 encoding "
f"approach, which supports the 'add_offset' and "
f"'scale_factor' parameters, additionally set "
f"encoding['units'] to a unit of time, e.g. "
f"'seconds'. To proceed with encoding of {name!r} "
f"via xarray's new approach, remove any encoding "
f"entries for 'add_offset' or 'scale_factor'."
)
if "_FillValue" not in encoding and "missing_value" not in encoding:
encoding["_FillValue"] = np.iinfo(np.int64).min
data, units = encode_cf_timedelta(data, units, dtype)
safe_setitem(attrs, "units", units, name=name)
return Variable(dims, data, attrs, encoding, fastpath=True)

I am wondering because it seems like this logic is already covered in encode_cf_timedelta(), and adding this explicit check makes means that timedelta64[ns] data cannot be saved directly to netCDF3 files unless the units are converted first. timedelta64[ns] worked via encode_cf_timedelta as long as nanosecond precision isn't really needed, similar to how datetime64[ns] is OK. This was convenient for users of "human" time units.

Copy link
Member Author

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick feedback @shoyer!

I am wondering because it seems like this logic is already covered in encode_cf_timedelta(), and adding this explicit check makes means that timedelta64[ns] data cannot be saved directly to netCDF3 files unless the units are converted first. timedelta64[ns] worked via encode_cf_timedelta as long as nanosecond precision isn't really needed, similar to how datetime64[ns] is OK. This was convenient for users of "human" time units.

I agree this is annoying, and it seems like we should be able to relax it some. I'll think about it some more and see what I can do.

Comment on lines +121 to +128
raise ValueError(
f"Could not safely coerce default int64 _FillValue "
f"({default_int64_fill_value}) to the analogous int32 "
f"value ({default_int32_fill_value}), since it "
f"already exists as non-missing within variable "
f"{name!r}. Try explicitly setting "
f"encoding['_FillValue'] to another int32 value."
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is needed to check something different, i.e. that the new fill value does not conflict with an existing value in the array. I believe the downstream check in coerce_nc3_dtype would pass in that circumstance since we replace the int64 fill value with the new int32 fill value upstream of it.

# Overwrite the on-disk dtype encoding, which is numeric, with
# the dtype attribute stored on disk, which corresponds to
# a timedelta64 dtype.
encoding["dtype"] = attrs.pop("dtype")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this a bit in the initial PR. Would we be open to renaming the attribute that encodes the timedelta64 dtype "xarray_dtype"? This way we would not need to worry about overwriting the numeric dtype in the encoding and can still retain information about the timedelta64 encoding method used.

I think it is potentially useful, at least while we still have two different methods of decoding timedelta64 values, to be able to distinguish between them for round-tripping purposes (i.e. so that we do not add new dtype attributes to files that did not have them previously). Maybe that should not be a priority though.

Comment on lines +356 to +358
timedeltas = np.array(
[0, np.iinfo(np.int32).min, np.iinfo(np.int64).min]
).astype("timedelta64[s]")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this test covers that since np.iinfo(np.int64).min is NaT. I'll write the constructor like this instead so that becomes clearer:

timedeltas = np.array([0, np.iinfo(np.int32).min, "NaT"], dtype="timedelta64[s]")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot load timedelta64 encoded data from disk Writing timedelta64 to netCDF3 always raises an error
2 participants