-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example to docs of using memory-mapping #1245
Comments
@jakirkham I believe this can go under |
@DON-BRAN, agree "usage tips" sounds like a good place to put this. Though feedback from anyone else interested is welcome 🙂 |
Playing around with this, I realized that each indexing statement to the zarr array would cause the file to be opened – as opposed to creating one memory view which I could index into multiple times. This is an issue for my use case where I want to grab an irregularly spaced set of contiguous chunks, which zarr does not expose an API for. It could be useful for there also to be an example of just getting the memory mapped array out of a store for this case |
What about putting an |
I think the entire array is loaded into memory during chunk processing (since it's just one chunk) from the first read there. Maybe this is addressable with If it didn't, would I have to be concerned about the file closing before data is read? Especially if I used the stdlib mmap? I've gotten something pretty performant using the mmap "chunk": https://gist.github.com/ivirshup/5c7df5ed10517abf6567a6a9af6c7eaa Using the same case from the notebook, (but using python for both cases, instead of numba) I see about a 10x difference between going through the zarr.Array vs grabbing the chunk directly: from itertools import accumulate, chain
def get_compressed_vectors_old(
data, indices, indptr, row_idxs
):
slices = [slice(indptr[i], indptr[i+1]) for i in row_idxs]
out_data = np.concatenate([data[s] for s in slices])
out_indices = np.concatenate([indices[s] for s in slices])
out_indptr = list(accumulate(chain((0,), (s.stop - s.start for s in slices))))
return out_data, out_indices, out_indptr
cache_store = zarr.storage.LRUStoreCache(mmap_store, max_size=None)
cache_group = zarr.group(cache_store)
%%timeit
results = get_compressed_vectors_old(
cache_group["X/data"],
cache_group["X/indices"],
cache_group["X/indptr"][:],
np.arange(1000)
)
# 73.6 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
results = get_compressed_vectors_old(
mmap_data,
mmap_indices,
indptr,
np.arange(1000)
)
# 6.61 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
Was compression disabled? Otherwise it may need to load the whole array into memory to decompress them. |
Ah whoops, posted my notebook, not the gist: https://gist.github.com/ivirshup/5c7df5ed10517abf6567a6a9af6c7eaa Yes, compression was turned off. |
Right so am suggesting using |
Ah yeah, I'd added that in the comment above, but I'll add it to the gist too. It's now under the "trying a cache" header. |
There are a couple avenues one could explore:
One could use |
One can create a memory-mapped store by creating a subclass like this. We may want to add this to the docs. We may also want to graduate
_fromfile
tofromfile
xref: #377 (comment)
xref: #377 (comment)
The text was updated successfully, but these errors were encountered: