Description
Describe the issue linked to the documentation
I am very confused about the argument data
in create_array
. A common use case is to simply serialize an in memory array, in which case I tend to pass it as the data=in_memory_array
argument. However, I cannot find the data
argument in the documentation.
Using IPyhon, on the other hand, zarr.create_array
clearly has the argument, while zarr.Group.create_array
doesn't seem to expose the interface. I am quite confused about the discrepancy. If this is intentional, please document it.
LLM also suggest that
zarr.create_array("store.zarr", data=in_memory_data)
is more efficient than
arr = zar.create_arra("store.zarr", shape=in_memory_data.shape, dtype=in_memory_data.dtype)
arr[...] = in_memory_data
I have no idea whether this is true or not. zarr.create_array(..., data=in_memory_data)
might be indeed more efficient as it seems to be written asynchronously. But the documentation seems to by quite lacking, what the best practice is.
This might be a bit out of scope for this issue, this issue, so please tell me if it's out of scope. But from the documentation, I don't really see how to leverage the asynchronous nature of the zarr
implementation. A common pattern I encounter is, that data is generated in parallel using multiprocessing (as it is CPU bound) and persisted using zarr
(probably disc bound). Is there a preferred pattern, to use zarr
as an asynchronous sink for the generated data? If so, it would be great to include it in the docs.
Suggested fix for documentation
No response