Description
Hi,
I am in a case where I need to insert a dataset that does not contain all the variables in the zcollection. It works properly most of the time, but when my inserted dataset does not have all the dimensions stored in the collection, I get the following error:
When the insertion tries to add missing variables in the inserted dataset, it fails because zcollection.Dataset does not support adding a variable for which one dimension is unknown. Should we add dimension extension to zcollection.Dataset to solve this ?
Code for reproducing the error
from __future__ import annotations
from typing import Iterator
import datetime
import pprint
import dask.distributed
import fsspec
import numpy
import zcollection
import zcollection.tests.data
def create_dataset() -> zcollection.Dataset:
"""Create a dataset to record."""
generator: Iterator[zcollection.Dataset] = \
zcollection.tests.data.create_test_dataset_with_fillvalue()
return next(generator)
zds: zcollection.Dataset | None = create_dataset()
assert zds is not None
zds.to_xarray()
fs: fsspec.AbstractFileSystem = fsspec.filesystem('memory')
cluster = dask.distributed.LocalCluster(processes=False)
client = dask.distributed.Client(cluster)
partition_handler = zcollection.partitioning.Date(('time', ), resolution='M')
collection: zcollection.Collection = zcollection.create_collection(
'time', zds, partition_handler, '/my_collection', filesystem=fs)
collection.insert(zds.select_vars(['time']))
Workaround
For now, I am preprocessing the dataset by rebuilding it from scratch and adding carefully selected variables from the zcollection with the missing dimensions. Then I drop these variables to retrieve the original dataset with its new dimensions. This is not satisfying in case of a non-delayed dataset, because we add non-necessary memory usage by creating new arrays.