-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group together decoding options into a single argument #4490
Comments
Totally in favour of this. Option 2 does seem like a good intermediate step. This proposal would make error handling easier (#3020) I agree that we should add CF to the names. Can we use |
works for me! |
I agree,
For both these proposals, we would lose the autocompletion with the tab but, on the other hand, the user would be relieved of managing a class. |
I see this as complementary to @alexamici's proposal in #4309 (comment), which I also like. I guess the main difference is moving Auto-completion is one reason to prefer a class, but enforced error checking and consistency between backends in the data model are also good reasons. In particular, it is important that users get an error if they mis-spell an argument name, e.g., I guess cfgrib is an example of a backend with its own CF decoding options? This is indeed a tricky design question. I don't know if it's possible to make |
@shoyer I favour option 2. as a stable solution, possibly with all arguments moved to keyword-only ones:
I'm for using the words decode/decoding but I'm against the use of CF. What backend will do is map the on-disk representation of the data (typically optimised for space) to the in-memory representation preferred by xarray (typically optimised of easy of use). This mapping is especially tricky for time-like variables. CF is one possible way to specify the map, but it is not the only one. Both the GRIB format and all the spatial formats supported by GDAL/rasterio can encode rich data and decoding has (typically) nothing to do with the CF conventions. My preferred meaning for the
Typically when a user asks the backend not to decode they intend to insepct the content of the data file to understand why something is not mapping in the expected way. As an example: in the case of GRIB time-like values are represented as integers like |
Taking into account the comments in this issue and the calls, I would propose this solution: #4547 |
I had a similar idea in #9379 but forgot to check this issue didn't already exist 🤦♂️ . A few points made there that weren't already made here:
It is unclear to me now whether or not the role of the decoding step needs to be CF-specific. Decoding according to some conventions is something that is enshrined for the Climate Forecast community, but definitely exists for other fields of science too. The ubiquity of netCDF means that it is not actually unusual to see e.g. plasma physics data with their own plasma-specific conventions stored in a netCDF file. Even the full definition of what variables are coordinate variables is mediated by the CF conventions, but other domains mights have different ideas about that. If you're opening Zarr then clearly you don't want to be constrained to follow the CF conventions. As all these kwargs are passed on to the public Though perhaps I'm missing the point here and we are conflating "decode data from format returned by backend to useful in-memory types" with "apply domain-specific metadata-dependent transformations". One could for example use a
I had suggested a |
throwing in another idea: what if instead of boolean flags we change xr.decode(ds, decoders: Sequence[str | callable] | None = ["cf"], disable: Sequence[str] | None = None) where the order in You'd use it like this: xr.decode(ds, decoders=["cf", custom_decoder_func], disable=["mask", "scale", "datetime"])
xr.decode(ds, decoders=["mask", "scale", custom_decoder_func]) where Then all |
I'm not sure what my previous comment was referring to but the current coders are CF-specific and so should be exposed with CF in the name IMO. I agree with using Also some of these "coders" can have complicated options: Example: https://cf-xarray.readthedocs.io/en/latest/coding.html#compression-by-gathering could decode to multiindex or sparse. We could encode/decode multiindex/sparse to a number of Discrete Sampling Geometry specifications etc. So using dataclasses seems like a good idea here. How about: xr.decode(ds: Dataset, coders: Sequence[Coder] | None = CFCoders()) with a helper |
I think the point is just that if we are going to refactor this we should take the opportunity to generalize to allow supporting non-CF decoders too.
I like this, this is very neat. Then is the idea is that |
and would |
yes to be clear:
for example.
Yes. The wrinkle is that we might want to set different options for different backends: example, zarr can roundtrip datetime64 just fine without any encoding. Perhaps this is another reason to keep the separate |
👍 for the suggestion It seems a good idea to check the metadata if a certain coder is applicable or not. Another thing, currently we do not differentiate One possible reason is that we can't mask integer values without casting to float (at least atm, until there will be nullable integers). |
Backends could, in theory, filter the coders they forward to |
Is your feature request related to a problem? Please describe.
open_dataset()
currently has a very long function signature. This makes it hard to keep track of everything it can do, and is particularly problematic for the authors of new backends (e.g., see #4477), which might need to know how to handle all these arguments.Describe the solution you'd like
To simple the interface, I propose to group together all the decoding options into a new
DecodingOptions
class. I'm thinking something like:The signature of
open_dataset
would then become:Question: are
decode
andDecodingOptions
the right names? Maybe these should still include the name "CF", e.g.,decode_cf
andCFDecodingOptions
, given that these are specific to CF conventions?Note: the current signature is
open_dataset(filename_or_obj, group=None, decode_cf=True, mask_and_scale=None, decode_times=True, autoclose=None, concat_characters=True, decode_coords=True, engine=None, chunks=None, lock=None, cache=None, drop_variables=None, backend_kwargs=None, use_cftime=None, decode_timedelta=None)
Usage with the new interface would look like
xr.open_dataset(filename, decode=False)
orxr.open_dataset(filename, decode=xr.DecodingOptions(mask=False, scale=False))
.This requires a little bit more typing than what we currently have, but it has a few advantages:
open_dataset
interface. For example, I separated outmask
andscale
arguments, versus the currentmask_and_scale
argument.open_dataset()
needs to handle every option supported byopen_dataset()
, this makes that task significantly easier. The only decoding options they need to worry about are non-default options that were explicitly set, i.e., those exposed by thenon_defaults()
method. If another decoding option wasn't explicitly set and isn't recognized by the backend, they can just ignore it.Describe alternatives you've considered
For the overall approach:
open_dataset()
interface, and then internally convert into theDecodingOptions()
struct for passing to backend constructors. This would provide much needed flexibility for backend authors, but most users wouldn't benefit from the new interface. Perhaps this would make sense as an intermediate step?The text was updated successfully, but these errors were encountered: