Skip to content

insert() a partial dataset -> auto add dimension ? #11

Open
@robin-cls

Description

Hi,

I am in a case where I need to insert a dataset that does not contain all the variables in the zcollection. It works properly most of the time, but when my inserted dataset does not have all the dimensions stored in the collection, I get the following error:

image

When the insertion tries to add missing variables in the inserted dataset, it fails because zcollection.Dataset does not support adding a variable for which one dimension is unknown. Should we add dimension extension to zcollection.Dataset to solve this ?

Code for reproducing the error

from __future__ import annotations

from typing import Iterator
import datetime
import pprint

import dask.distributed
import fsspec
import numpy

import zcollection
import zcollection.tests.data

def create_dataset() -> zcollection.Dataset:
    """Create a dataset to record."""
    generator: Iterator[zcollection.Dataset] = \
        zcollection.tests.data.create_test_dataset_with_fillvalue()
    return next(generator)


zds: zcollection.Dataset | None = create_dataset()
assert zds is not None
zds.to_xarray()

fs: fsspec.AbstractFileSystem = fsspec.filesystem('memory')
cluster = dask.distributed.LocalCluster(processes=False)
client = dask.distributed.Client(cluster)

partition_handler = zcollection.partitioning.Date(('time', ), resolution='M')
collection: zcollection.Collection = zcollection.create_collection(
    'time', zds, partition_handler, '/my_collection', filesystem=fs)

collection.insert(zds.select_vars(['time']))

Workaround

For now, I am preprocessing the dataset by rebuilding it from scratch and adding carefully selected variables from the zcollection with the missing dimensions. Then I drop these variables to retrieve the original dataset with its new dimensions. This is not satisfying in case of a non-delayed dataset, because we add non-necessary memory usage by creating new arrays.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions