Use xarray's encode_cf / decode_cf functions to handle CF conventions #157

TomNicholas · 2024-06-24T20:44:37Z

We might be able to replace some logic (especially that implemented in #156) with a call to xarray.decode_cf. This function can accept a Dataset and return a new dataset with different attributes and encoding etc. I think it's used internally in some of xarray's backends.

Also there is an idea to expose a corresponding xarray.encode_cf function in xarray, which we might also be able to use (see pydata/xarray#4412).

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-08-17T23:50:12Z

I've been looking at this more, in the context of fixing #189.

I think we could probably almost use xarray.decode_cf out of the box, with a few subtleties:

Most of the actual operations of decode_cf can be turned off via flags, which is what we want, except for the BooleanCoder usage here
decode_cf includes a bunch of more complicated logic for deciding which variables are coords, and we really need that to be the same as in xarray to guard against issues like open_virtual_dataset returns some coordinates as data variables #189.
We don't want the wrapping by LazilyIndexedArray here

Aligning what we're doing in virtualizarr with this would be another reason to implement a virtualizarr xarray backend engine, see #35.

cc @ayushnag

ayushnag · 2024-08-19T20:47:20Z

I think the idea of an virtualizarr backend is appealing. One idea for how it can be implemented is by loading the actual data file and then also creating and storing the byte references in the virtual dataset. This way the dataset structure creation and data loading is handled by xarray, then the bytes are just an add on. This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.

As a side note, this might make the reference creation process (#87) simpler as well. Instead of searching for attrs, encoding, dimension names, etc the "chunk reader" only needs to create a low level chunk manifest (bytes, offset, path). The rest of the information is retrieved from the netcdf by xarray. Not sure if that is an actually time consuming part of reference creation however.

It would also add the ability to easily inline data (#62)

Basically the idea is that xr.open_dataset("data.nc", engine="virtualizarr") loads the netcdf file normally but then also reads byte ranges and creates ManifestArrays. Since I don't think it's possible to have two data arrays within one variable, perhaps all the data arrays will be replaced with ManifestArrays unless the top level params ask for data such as loadable_variables and cftime_variables

TomNicholas · 2024-08-20T17:27:26Z

Thanks @ayushnag . I think your suggestion is almost the opposite of what I'm suggesting doing in this issue, so I moved it to #221 to discuss there.

(Opposite in that I'm suggesting calling just part of xarray's open_dataset decoding logic from within open_virtual_dataset, whereas IIUC you are suggesting basically using all of xr.open_dataset and making it virtual as a final step.)

ayushnag · 2024-08-20T18:38:36Z

I think your suggestion makes sense and yes my suggestion was more related to the backend issue.

My only concern was that if the solution is to add this to open_virtual_dataset it can get tricky to essentially reimplement many parts of xarray and then also maintain them with upstream xarray changes. That's why I was suggesting opening the file and then adding references.

Although if decode_cf is the main portion left then the reimplementations are limited and your approach definitely makes more sense. Is the plan to add a modified copy of this function to virtualizarr?

TomNicholas · 2024-08-20T18:54:47Z

it can get tricky to essentially reimplement many parts of xarray and then also maintain them with upstream xarray changes.

I completely agree, which is why I've been looking into which parts of xarray we need. I think it's just decode_cf though (and actually only very limited parts of that).

Is the plan to add a modified copy of this function to virtualizarr?

Either that or make some small modifications to xarray upstream so that we can import and use xr.decode_cf directly. See also pydata/xarray#4490, which talks about making xarray's interface for decoding clearer.

TomNicholas added xarray Requires changes to xarray upstream encoding labels Jun 24, 2024

This was referenced Jun 24, 2024

Identifying coordinate variables #105

Closed

Identify non dimension coords #156

Merged

TomNicholas added the CF conventions label Jun 24, 2024

This was referenced Jun 24, 2024

How to handle encoding #68

Open

Option to interpret variables using cftime #117

Closed

TomNicholas mentioned this issue Aug 20, 2024

Xarray backend which loads data by default #221

Open

TomNicholas mentioned this issue Aug 23, 2024

Fix determination of which variables are coordinates #224

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use xarray's encode_cf / decode_cf functions to handle CF conventions #157

Use xarray's encode_cf / decode_cf functions to handle CF conventions #157

TomNicholas commented Jun 24, 2024

TomNicholas commented Aug 17, 2024

ayushnag commented Aug 19, 2024

TomNicholas commented Aug 20, 2024

ayushnag commented Aug 20, 2024

TomNicholas commented Aug 20, 2024

Use xarray's encode_cf / decode_cf functions to handle CF conventions #157

Use xarray's encode_cf / decode_cf functions to handle CF conventions #157

Comments

TomNicholas commented Jun 24, 2024

TomNicholas commented Aug 17, 2024

ayushnag commented Aug 19, 2024

TomNicholas commented Aug 20, 2024

ayushnag commented Aug 20, 2024

TomNicholas commented Aug 20, 2024