Description
Zarr version
2.14.2
Numcodecs version
0.11.0
Python Version
3.10
Operating System
Linux
Installation
Pip with virtualenv
Description
I was using zarr arrays as a grouped set of related numpy arrays. I noticed when I switched the in-memory performance dropped significantly. I disabled the compressor and chunking to remove any overhead I can find.
I attached a short snippet with line_profiler to demonstrate the basic case of just writing elements to an array. 99 percent of the time is spent writing to the Zarr array instead of the NumPy array of the same size and shape.
Having looked at the source code for MemoryStore, I can see that the chunk is seralised as bytes and stored in a dictionary with key 0.0
and bytes value which I presume reflects the filesystem but this perhaps is where it goes really slow compared to NumPy.
Is this not a use case for Zarr? Is it optimised for reads instead? I understand if this out of context for Zarr arrays. Thank you for your time.
Steps to reproduce
import numcodecs
import numpy as np
import tqdm
import zarr
print(zarr.__version__)
print(numcodecs.__version__)
mem_store = zarr.storage.MemoryStore()
z_array = zarr.zeros(
(200000, 100), chunks=False, store=mem_store, compressor=None, dtype=np.float32, write_empty_chunks=False
)
np_array = np.zeros((200000, 100), dtype=np.float32)
print(z_array.info)
@profile
def row_by_row():
"""Row by row."""
for i in tqdm.trange(100):
r_array = np.random.random(100)
np_array[i] = r_array
z_array[i] = r_array
@profile
def in_chunks():
"""In chunks."""
for i in tqdm.trange(100):
r_array = np.random.random((200, 100))
np_array[:200] = r_array
z_array[:200] = r_array
def main():
"""Run the main function."""
row_by_row()
in_chunks()
if __name__ == "__main__":
main()
Additional output
2.14.2
0.11.0
Type : zarr.core.Array
Data type : float32
Shape : (200000, 100)
Chunk shape : (200000, 100)
Order : C
Read-only : False
Compressor : None
Store type : zarr.storage.MemoryStore
No. bytes : 80000000 (76.3M)
No. bytes stored : 231
Storage ratio : 346320.3
Chunks initialized : 0/1
Wrote profile results to scribble.py.lprof
Timer unit: 1e-06 s
Total time: 6.58158 s
File: scribble.py
Function: row_by_row at line 19
Line # Hits Time Per Hit % Time Line Contents
==============================================================
19 @profile
20 def row_by_row():
21 """Row by row."""
22 100 14604.7 146.0 0.2 for i in tqdm.trange(100):
23 100 1267.1 12.7 0.0 r_array = np.random.random(100)
24 100 456.9 4.6 0.0 np_array[i] = r_array
25 100 6565249.6 65652.5 99.8 z_array[i] = r_array
Total time: 6.55283 s
File: scribble.py
Function: in_chunks at line 28
Line # Hits Time Per Hit % Time Line Contents
==============================================================
28 @profile
29 def in_chunks():
30 """In chunks."""
31 100 13821.0 138.2 0.2 for i in tqdm.trange(100):
32 100 13546.1 135.5 0.2 r_array = np.random.random((200, 100))
33 100 1375.9 13.8 0.0 np_array[:200] = r_array
34 100 6524084.0 65240.8 99.6 z_array[:200] = r_array