You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think the idea of an virtualizarr backend is appealing. One idea for how it can be implemented is by loading the actual data file and then also creating and storing the byte references in the virtual dataset. This way the dataset structure creation and data loading is handled by xarray, then the bytes are just an add on. This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.
As a side note, this might make the reference creation process (#87) simpler as well. Instead of searching for attrs, encoding, dimension names, etc the "chunk reader" only needs to create a low level chunk manifest (bytes, offset, path). The rest of the information is retrieved from the netcdf by xarray. Not sure if that is an actually time consuming part of reference creation however.
It would also add the ability to easily inline data (#62)
Basically the idea is that xr.open_dataset("data.nc", engine="virtualizarr") loads the netcdf file normally but then also reads byte ranges and creates ManifestArrays. Since I don't think it's possible to have two data arrays within one variable, perhaps all the data arrays will be replaced with ManifestArrays unless the top level params ask for data such as loadable_variables and cftime_variables
This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.
It's not just about time, it's also about memory usage. Loading all data up front will use ~1e6x as much RAM (assuming that each chunk is 1MB). This is very wasteful if we know all we want to do is write out the metadata in a new form.
We also cannot do this when opening metadata-only representations (e.g. DMR++, existing kerchunk .json) without incurring a big performance hit by having to also GET the original files in addition to the metadata files.
Since I don't think it's possible to have two data arrays within one variable,
It's not, by definition.
perhaps all the data arrays will be replaced with ManifestArrays unless the top level params ask for data such as loadable_variables and cftime_variables
If the reader creates metadata-only ManifestArrays but has the option to materialize them via loadable_variables, then that's what we have already, and if there is an additional way to materialize the ManifestArrays, then that's just the suggestion in #124.
I feel like your suggestion is sort of suggesting having a virtualizarr function that takes xr.Dataset[np.ndarray] -> xr.Dataset[ManifestArray]. But this isn't possible because the np.ndarray contains no knowledge of the filepath it was loaded from.
I think the idea of an virtualizarr backend is appealing. One idea for how it can be implemented is by loading the actual data file and then also creating and storing the byte references in the virtual dataset. This way the dataset structure creation and data loading is handled by xarray, then the bytes are just an add on. This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.
As a side note, this might make the reference creation process (#87) simpler as well. Instead of searching for attrs, encoding, dimension names, etc the "chunk reader" only needs to create a low level chunk manifest (bytes, offset, path). The rest of the information is retrieved from the netcdf by xarray. Not sure if that is an actually time consuming part of reference creation however.
It would also add the ability to easily inline data (#62)
Basically the idea is that
xr.open_dataset("data.nc", engine="virtualizarr")
loads the netcdf file normally but then also reads byte ranges and createsManifestArrays
. Since I don't think it's possible to have two data arrays within one variable, perhaps all the data arrays will be replaced withManifestArrays
unless the top level params ask for data such asloadable_variables
andcftime_variables
Originally posted by @ayushnag in #157 (comment)
The text was updated successfully, but these errors were encountered: