Skip to content

_merge_chunk_array could allocate memory more carefully #2038

Open
@rabernat

Description

@rabernat

I was reviewing the Pipeline and has some observations. A key function is _merge_chunk_array

def _merge_chunk_array(
self,
existing_chunk_array: NDBuffer | None,
value: NDBuffer,
out_selection: SelectorTuple,
chunk_spec: ArraySpec,
chunk_selection: SelectorTuple,
drop_axes: tuple[int, ...],
) -> NDBuffer:
if is_total_slice(chunk_selection, chunk_spec.shape) and value.shape == chunk_spec.shape:
return value
if existing_chunk_array is None:
chunk_array = chunk_spec.prototype.nd_buffer.create(
shape=chunk_spec.shape,
dtype=chunk_spec.dtype,
order=chunk_spec.order,
fill_value=chunk_spec.fill_value,
)
else:
chunk_array = existing_chunk_array.copy() # make a writable copy
if chunk_selection == () or is_scalar(value.as_ndarray_like(), chunk_spec.dtype):
chunk_value = value
else:
chunk_value = value[out_selection]
# handle missing singleton dimensions
if drop_axes != ():
item = tuple(
None # equivalent to np.newaxis
if idx in drop_axes
else slice(None)
for idx in range(chunk_spec.ndim)
)
chunk_value = chunk_value[item]
chunk_array[chunk_selection] = chunk_value
return chunk_array

There are two issues with this implementation

  1. if existing_chunk_array is None: this code path always allocates a new empty array and then copies data from the original selection into it? Why? This seems like a great opportunity to avoid and extra memory copy. Can't we just use a view of the original data (values) when appropriate? This happens e.g. when writing to chunks which are smaller than the selection
  2. The same code path always creates data of shape chunk_spec.shape, even for the last chunk of an array, which might be much smaller than the chunk shape.

An example which highlights both of these inefficiencies is as follows:

import numpy as np
import zarr
zarr.array(np.ones(101), chunks=(100,))

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew features or improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions