_merge_chunk_array could allocate memory more carefully

I was reviewing the Pipeline and has some observations. A key function is `_merge_chunk_array`

https://github.com/zarr-developers/zarr-python/blob/b8baa6868c8fa95e6b2948c9fd9c725331ca23ec/src/zarr/codecs/pipeline.py#L298-L332

There are two issues with this implementation

1. `if existing_chunk_array is None: ` this code path always allocates a new empty array and then copies data from the original selection into it? Why? This seems like a great opportunity to avoid and extra memory copy. Can't we just use a view of the original data (`values`) when appropriate? This happens e.g. when writing to chunks which are smaller than the selection
2. The same code path always creates data of shape `chunk_spec.shape`, _even for the last chunk of an array_, which might be much smaller than the chunk shape.

An example which highlights both of these inefficiencies is as follows:

```python
import numpy as np
import zarr
zarr.array(np.ones(101), chunks=(100,))
```



	def _merge_chunk_array(
	self,
	existing_chunk_array: NDBuffer \| None,
	value: NDBuffer,
	out_selection: SelectorTuple,
	chunk_spec: ArraySpec,
	chunk_selection: SelectorTuple,
	drop_axes: tuple[int, ...],
	) -> NDBuffer:
	if is_total_slice(chunk_selection, chunk_spec.shape) and value.shape == chunk_spec.shape:
	return value
	if existing_chunk_array is None:
	chunk_array = chunk_spec.prototype.nd_buffer.create(
	shape=chunk_spec.shape,
	dtype=chunk_spec.dtype,
	order=chunk_spec.order,
	fill_value=chunk_spec.fill_value,
	)
	else:
	chunk_array = existing_chunk_array.copy() # make a writable copy
	if chunk_selection == () or is_scalar(value.as_ndarray_like(), chunk_spec.dtype):
	chunk_value = value
	else:
	chunk_value = value[out_selection]
	# handle missing singleton dimensions
	if drop_axes != ():
	item = tuple(
	None # equivalent to np.newaxis
	if idx in drop_axes
	else slice(None)
	for idx in range(chunk_spec.ndim)
	)
	chunk_value = chunk_value[item]
	chunk_array[chunk_selection] = chunk_value
	return chunk_array

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

_merge_chunk_array could allocate memory more carefully #2038

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

_merge_chunk_array could allocate memory more carefully #2038

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions