Use Case: [C]Worthy OAE dataset #119

TomNicholas · 2024-09-27T16:58:10Z

This issue but for icechunk: zarr-developers/VirtualiZarr#132

I was originally planning to virtualize this [C]Worthy dataset and save the references using the kerchunk parquet format, but now the timelines have changed such that both icechunk and the [C]Worthy OAE atlas are planned to release on the same day (Oct 15th 2024)! So I could use icechunk's format instead (or just write both)...

I think it's pretty unlikely that virtualizing using icechunk happens by then (I have enough work to do to just release the un-virtualized version of the dataset) but I do need to do all this by December anyway because I submitted this as a talk to AGU 🙃 Regardless of when this dataset is a good real-world test case for icechunk - as I said in zarr-developers/VirtualiZarr#132:

If we can virtualize this we should be able to virtualize most things 💪

Wishlist:

Writing virtual references at all in icechunk (Add virtual ref support EAR-1183 #85)
- Writing virtual references efficiently in bulk (I have ~10 arrays of ~500k virtual chunks to write, and another ~30 arrays with fewer chunks. No groups, all in one group.)
PR to virtualizarr for writing virtual refs to icechunk (Write virtual references to Icechunk VirtualiZarr#1, now here Add Icechunk Support zarr-developers/VirtualiZarr#256)
Writing inlined and non-virtual chunks
Datetime support (should just work)
Obviously xarray support so [C]Worthy users can actually read the virtualized OAE dataset data
A few outstanding things listed in Aspirational use case: [C]Worthy mCDR OAE Atlas dataset zarr-developers/VirtualiZarr#132, particularly
- Doing the big metadata-extraction at scale over all the 500k files. There are clever ways we could do this as a parallel tree-reduction (dask.bag, cubed) but getting it done at all is a pre-requisite.

The text was updated successfully, but these errors were encountered:

dcherian · 2024-09-27T17:03:11Z

Datetime support

you probably don't need this since Xarray encodes datetimes by default.

TomNicholas · 2024-09-27T17:14:53Z

you probably don't need this since Xarray encodes datetimes by default.

You mean if I save the time coordinates as non-virtual zarr arrays then xarray's decoding should handle this as normal?

dcherian · 2024-09-27T17:18:18Z

if you go through xarray, yes.

rabernat · 2024-09-27T17:21:21Z

Should also work with virtual data. Usual CF datasets use int as the raw array dtype and then have attributes like units: days since X, which Xarray / CFTime decode to python datetimes. There is no native datetime type in netcdf.

TomNicholas added use case 🌎 Real-world use case virtual references 👻 Involves virtual kerchunk/virtualizarr chunk references labels Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Case: [C]Worthy OAE dataset #119

Use Case: [C]Worthy OAE dataset #119

TomNicholas commented Sep 27, 2024 •

edited

Loading

dcherian commented Sep 27, 2024

TomNicholas commented Sep 27, 2024

dcherian commented Sep 27, 2024

rabernat commented Sep 27, 2024

Use Case: [C]Worthy OAE dataset #119

Use Case: [C]Worthy OAE dataset #119

Comments

TomNicholas commented Sep 27, 2024 • edited Loading

dcherian commented Sep 27, 2024

TomNicholas commented Sep 27, 2024

dcherian commented Sep 27, 2024

rabernat commented Sep 27, 2024

TomNicholas commented Sep 27, 2024 •

edited

Loading