Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example to docs of using memory-mapping #1245

Open
jakirkham opened this issue Nov 2, 2022 · 11 comments
Open

Add example to docs of using memory-mapping #1245

jakirkham opened this issue Nov 2, 2022 · 11 comments
Labels
documentation Improvements to the documentation good-first-issue Good place to get started as a new contributor.

Comments

@jakirkham
Copy link
Member

jakirkham commented Nov 2, 2022

One can create a memory-mapped store by creating a subclass like this. We may want to add this to the docs. We may also want to graduate _fromfile to fromfile

class MemoryMappedDirectoryStore(DirectoryStore):
    def _fromfile(self, fn):
        with open(fn, "rb") as fh:
            return memoryview(mmap.mmap(fh.fileno(), 0, prot=mmap.PROT_READ))

xref: #377 (comment)
xref: #377 (comment)

@jakirkham jakirkham added documentation Improvements to the documentation good-first-issue Good place to get started as a new contributor. labels Nov 2, 2022
@DON-BRAN
Copy link
Contributor

DON-BRAN commented Nov 2, 2022

One can create a memory-mapped store by creating a subclass like this. We may want to add this to the docs. We may also want to graduate _fromfile to fromfile

class MemoryMappedDirectoryStore(DirectoryStore):
    def _fromfile(self, fn):
        with open(fn, "rb") as fh:
            return memoryview(mmap.mmap(fh.fileno(), 0, prot=mmap.PROT_READ))

xref: #377 (comment) xref: #377 (comment)

@jakirkham I believe this can go under usage tips in the tutorials.

@jakirkham
Copy link
Member Author

cc @ivirshup @joshmoore

@jakirkham
Copy link
Member Author

@DON-BRAN, agree "usage tips" sounds like a good place to put this. Though feedback from anyone else interested is welcome 🙂

@ivirshup
Copy link

ivirshup commented Nov 3, 2022

Playing around with this, I realized that each indexing statement to the zarr array would cause the file to be opened – as opposed to creating one memory view which I could index into multiple times.

This is an issue for my use case where I want to grab an irregularly spaced set of contiguous chunks, which zarr does not expose an API for.

It could be useful for there also to be an example of just getting the memory mapped array out of a store for this case

@jakirkham
Copy link
Member Author

What about putting an LRUStoreCache in-between?

@ivirshup
Copy link

ivirshup commented Nov 3, 2022

I think the entire array is loaded into memory during chunk processing (since it's just one chunk) from the first read there. Maybe this is addressable with meta_array?

If it didn't, would I have to be concerned about the file closing before data is read? Especially if I used the stdlib mmap?

I've gotten something pretty performant using the mmap "chunk": https://gist.github.com/ivirshup/5c7df5ed10517abf6567a6a9af6c7eaa

Using the same case from the notebook, (but using python for both cases, instead of numba) I see about a 10x difference between going through the zarr.Array vs grabbing the chunk directly:

from itertools import accumulate, chain

def get_compressed_vectors_old(
    data, indices, indptr, row_idxs
):
    slices = [slice(indptr[i], indptr[i+1]) for i in row_idxs]
    out_data = np.concatenate([data[s] for s in slices])
    out_indices = np.concatenate([indices[s] for s in slices])
    out_indptr = list(accumulate(chain((0,), (s.stop - s.start for s in slices))))
    return out_data, out_indices, out_indptr

cache_store = zarr.storage.LRUStoreCache(mmap_store, max_size=None)
cache_group = zarr.group(cache_store)

%%timeit
results = get_compressed_vectors_old(
    cache_group["X/data"],
    cache_group["X/indices"],
    cache_group["X/indptr"][:],
    np.arange(1000)
)
# 73.6 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
results = get_compressed_vectors_old(
    mmap_data,
    mmap_indices,
    indptr,
    np.arange(1000)
)
# 6.61 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jakirkham
Copy link
Member Author

Was compression disabled? Otherwise it may need to load the whole array into memory to decompress them.

@ivirshup
Copy link

ivirshup commented Nov 7, 2022

Ah whoops, posted my notebook, not the gist:

https://gist.github.com/ivirshup/5c7df5ed10517abf6567a6a9af6c7eaa

Yes, compression was turned off.

@jakirkham
Copy link
Member Author

Right so am suggesting using LRUStoreCache(MemoryMappedDirectoryStore("mmap_store.zarr")) as mmap_store.

@ivirshup
Copy link

ivirshup commented Nov 8, 2022

Ah yeah, I'd added that in the comment above, but I'll add it to the gist too. It's now under the "trying a cache" header.

@jakirkham
Copy link
Member Author

There are a couple avenues one could explore:

One could use tracemalloc with the numpy.lib.tracemalloc_domain to catch any memory usage by NumPy and tie it back to specific code that caused it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements to the documentation good-first-issue Good place to get started as a new contributor.
Projects
None yet
Development

No branches or pull requests

3 participants