Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.encode_cf function #4412

Open
eric-czech opened this issue Sep 8, 2020 · 3 comments
Open

Dataset.encode_cf function #4412

eric-czech opened this issue Sep 8, 2020 · 3 comments

Comments

@eric-czech
Copy link

I would like to be able to apply CF encoding to an existing DataArray (or multiple in a Dataset) and then store the encoded forms elsewhere. Is this already possible?

More specifically, I would like to encode a large array of 32-bit floats as 8-bit ints and then write them to a Zarr store using rechunker.

I'm essentially after this pangeo-data/rechunker#45 (Xarray support in rechunker), but I'm looking for what functionality exists in Xarray to make it possible in the meantime.

@dcherian
Copy link
Contributor

dcherian commented Sep 8, 2020

Not at the moment.

I think we should add an xr.encode_cf that wraps conventions.cf_encoder (this may have already come up in the "flexible backends" discussions). This would parallel xr.decode_cf

def cf_encoder(variables, attributes):
"""
Encode a set of CF encoded variables and attributes.
Takes a dicts of variables and attributes and encodes them
to conform to CF conventions as much as possible.
This includes masking, scaling, character array handling,
and CF-time encoding.
Parameters
----------
variables : dict
A dictionary mapping from variable name to xarray.Variable
attributes : dict
A dictionary mapping from attribute name to value
Returns
-------
encoded_variables : dict
A dictionary mapping from variable name to xarray.Variable,
encoded_attributes : dict
A dictionary mapping from attribute name to value
See also
--------
decode_cf_variable, encode_cf_variable
"""
# add encoding for time bounds variables if present.
_update_bounds_encoding(variables)
new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
# Remove attrs from bounds variables (issue #2921)
for var in new_vars.values():
bounds = var.attrs["bounds"] if "bounds" in var.attrs else None
if bounds and bounds in new_vars:
# see http://cfconventions.org/cf-conventions/cf-conventions.html#cell-boundaries
for attr in [
"units",
"standard_name",
"axis",
"positive",
"calendar",
"long_name",
"leap_month",
"leap_year",
"month_lengths",
]:
if attr in new_vars[bounds].attrs and attr in var.attrs:
if new_vars[bounds].attrs[attr] == var.attrs[attr]:
new_vars[bounds].attrs.pop(attr)
return new_vars, attributes

It'll also need to wrap this logic:

xarray/xarray/backends/api.py

Lines 1113 to 1127 in 66259d1

if encoding is None:
encoding = {}
variables, attrs = conventions.encode_dataset_coordinates(dataset)
check_encoding = set()
for k, enc in encoding.items():
# no need to shallow copy the variable again; that already happened
# in encode_dataset_coordinates
variables[k].encoding = enc
check_encoding.add(k)
if encoder:
variables, attrs = encoder(variables, attrs)

For simple use cases, you could write a small wrapper for .cf_encoder that takes datasets and returns datasets and it should work just fine (Look at conventions.decode_cf).

@eric-czech
Copy link
Author

Ok thanks @dcherian! I'll try that (feel free to close this).

@dcherian
Copy link
Contributor

dcherian commented May 10, 2023

Related request for to_zarr(..., encode_cf=False): #5405

This came up in the discussion today.

cc @tom-white @kmuehlbauer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants