Skip to content

In Memory performance compared to NumPy much slower #1395

Open
@nuric

Description

@nuric

Zarr version

2.14.2

Numcodecs version

0.11.0

Python Version

3.10

Operating System

Linux

Installation

Pip with virtualenv

Description

I was using zarr arrays as a grouped set of related numpy arrays. I noticed when I switched the in-memory performance dropped significantly. I disabled the compressor and chunking to remove any overhead I can find.

I attached a short snippet with line_profiler to demonstrate the basic case of just writing elements to an array. 99 percent of the time is spent writing to the Zarr array instead of the NumPy array of the same size and shape.

Having looked at the source code for MemoryStore, I can see that the chunk is seralised as bytes and stored in a dictionary with key 0.0 and bytes value which I presume reflects the filesystem but this perhaps is where it goes really slow compared to NumPy.

Is this not a use case for Zarr? Is it optimised for reads instead? I understand if this out of context for Zarr arrays. Thank you for your time.

Steps to reproduce

import numcodecs
import numpy as np
import tqdm
import zarr

print(zarr.__version__)
print(numcodecs.__version__)

mem_store = zarr.storage.MemoryStore()
z_array = zarr.zeros(
    (200000, 100), chunks=False, store=mem_store, compressor=None, dtype=np.float32, write_empty_chunks=False
)
np_array = np.zeros((200000, 100), dtype=np.float32)

print(z_array.info)


@profile
def row_by_row():
    """Row by row."""
    for i in tqdm.trange(100):
        r_array = np.random.random(100)
        np_array[i] = r_array
        z_array[i] = r_array


@profile
def in_chunks():
    """In chunks."""
    for i in tqdm.trange(100):
        r_array = np.random.random((200, 100))
        np_array[:200] = r_array
        z_array[:200] = r_array


def main():
    """Run the main function."""
    row_by_row()
    in_chunks()


if __name__ == "__main__":
    main()

Additional output

2.14.2
0.11.0
Type               : zarr.core.Array
Data type          : float32
Shape              : (200000, 100)
Chunk shape        : (200000, 100)
Order              : C
Read-only          : False
Compressor         : None
Store type         : zarr.storage.MemoryStore
No. bytes          : 80000000 (76.3M)
No. bytes stored   : 231
Storage ratio      : 346320.3
Chunks initialized : 0/1

Wrote profile results to scribble.py.lprof
Timer unit: 1e-06 s

Total time: 6.58158 s
File: scribble.py
Function: row_by_row at line 19

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    19                                           @profile
    20                                           def row_by_row():
    21                                               """Row by row."""
    22       100      14604.7    146.0      0.2      for i in tqdm.trange(100):
    23       100       1267.1     12.7      0.0          r_array = np.random.random(100)
    24       100        456.9      4.6      0.0          np_array[i] = r_array
    25       100    6565249.6  65652.5     99.8          z_array[i] = r_array

Total time: 6.55283 s
File: scribble.py
Function: in_chunks at line 28

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    28                                           @profile
    29                                           def in_chunks():
    30                                               """In chunks."""
    31       100      13821.0    138.2      0.2      for i in tqdm.trange(100):
    32       100      13546.1    135.5      0.2          r_array = np.random.random((200, 100))
    33       100       1375.9     13.8      0.0          np_array[:200] = r_array
    34       100    6524084.0  65240.8     99.6          z_array[:200] = r_array

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePotential issues with Zarr performance (I/O, memory, etc.)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions