Skip to content

DOC: create_array(..., data=,...) #2809

Open
@DerWeh

Description

@DerWeh

Describe the issue linked to the documentation

I am very confused about the argument data in create_array. A common use case is to simply serialize an in memory array, in which case I tend to pass it as the data=in_memory_array argument. However, I cannot find the data argument in the documentation.

Using IPyhon, on the other hand, zarr.create_array clearly has the argument, while zarr.Group.create_array doesn't seem to expose the interface. I am quite confused about the discrepancy. If this is intentional, please document it.
LLM also suggest that

zarr.create_array("store.zarr", data=in_memory_data)

is more efficient than

arr = zar.create_arra("store.zarr", shape=in_memory_data.shape, dtype=in_memory_data.dtype)
arr[...] = in_memory_data

I have no idea whether this is true or not. zarr.create_array(..., data=in_memory_data) might be indeed more efficient as it seems to be written asynchronously. But the documentation seems to by quite lacking, what the best practice is.


This might be a bit out of scope for this issue, this issue, so please tell me if it's out of scope. But from the documentation, I don't really see how to leverage the asynchronous nature of the zarr implementation. A common pattern I encounter is, that data is generated in parallel using multiprocessing (as it is CPU bound) and persisted using zarr (probably disc bound). Is there a preferred pattern, to use zarr as an asynchronous sink for the generated data? If so, it would be great to include it in the docs.

Suggested fix for documentation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements to the documentationhelp wantedIssue could use help from someone with familiarity on the topic

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions