Open
Description
I was reviewing the Pipeline and has some observations. A key function is _merge_chunk_array
zarr-python/src/zarr/codecs/pipeline.py
Lines 298 to 332 in b8baa68
There are two issues with this implementation
if existing_chunk_array is None:
this code path always allocates a new empty array and then copies data from the original selection into it? Why? This seems like a great opportunity to avoid and extra memory copy. Can't we just use a view of the original data (values
) when appropriate? This happens e.g. when writing to chunks which are smaller than the selection- The same code path always creates data of shape
chunk_spec.shape
, even for the last chunk of an array, which might be much smaller than the chunk shape.
An example which highlights both of these inefficiencies is as follows:
import numpy as np
import zarr
zarr.array(np.ones(101), chunks=(100,))