insert() a partial dataset -> auto add dimension ?

Hi,

I am in a case where I need to insert a dataset that does not contain all the variables in the zcollection. It works properly most of the time, but when my inserted dataset does not have all the dimensions stored in the collection, I get the following error:

![image](https://github.com/user-attachments/assets/d01d5e65-95ae-4e4c-aef4-4820e70bd0a4)

When the insertion tries to add missing variables in the inserted dataset, it fails because zcollection.Dataset does not support adding a variable for which one dimension is unknown. Should we add dimension extension to zcollection.Dataset to solve this ?

**Code for reproducing the error**

```python
from __future__ import annotations

from typing import Iterator
import datetime
import pprint

import dask.distributed
import fsspec
import numpy

import zcollection
import zcollection.tests.data

def create_dataset() -> zcollection.Dataset:
    """Create a dataset to record."""
    generator: Iterator[zcollection.Dataset] = \
        zcollection.tests.data.create_test_dataset_with_fillvalue()
    return next(generator)


zds: zcollection.Dataset | None = create_dataset()
assert zds is not None
zds.to_xarray()

fs: fsspec.AbstractFileSystem = fsspec.filesystem('memory')
cluster = dask.distributed.LocalCluster(processes=False)
client = dask.distributed.Client(cluster)

partition_handler = zcollection.partitioning.Date(('time', ), resolution='M')
collection: zcollection.Collection = zcollection.create_collection(
    'time', zds, partition_handler, '/my_collection', filesystem=fs)

collection.insert(zds.select_vars(['time']))
```

**Workaround**

For now, I am preprocessing the dataset by rebuilding it from scratch and adding carefully selected variables from the zcollection with the missing dimensions. Then I drop these variables to retrieve the original dataset with its new dimensions. This is not satisfying in case of a non-delayed dataset, because we add non-necessary memory usage by creating new arrays.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insert() a partial dataset -> auto add dimension ? #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development