-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memmap reads from directory store #265
Comments
Hi @artttt, thanks a lot for sharing.
This is a neat idea. It would only work where you have an array with no
compressor and no filters. As soon as you introduce a compressor or any
filters then in general an entire chunk needs to be read in order to obtain
any data item within the chunk, because the data need to be decoded. There
is some related discussion here that may be of interest, regarding possible
optimisations where you want to extract only a few items from a each chunk:
#40.
Out of interest, if this only works without compression, what do you gain
from using zarr over a plain numpy memmap?
…On Monday, 4 June 2018, artttt ***@***.***> wrote:
Ive only recently started using zarr but im impressed. well done.
I want to share an experience and a possible enhancement.
In one of my use cases i use vindex heavily across the whole array. I know
this is likely a worst use case scenario as zarr is reading many many
chunks for a small amount of data in each one.
I was previously using numpy memmap arrays for a similar use and it was
much faster so i wondered if i used an uncompressed DirectoryStore if it
would read chunks as a memmap. no luck, still reading full chunks. So i had
a go at subclassing DirectoryStore to do this.
class MemMapReadStore(zarr.DirectoryStore):
"""Directory store using MemMap for reading chunks
"""
def __getitem__(self, key):
filepath = os.path.join(self.path, key)
if os.path.isfile(filepath):
#are there only 2 types of files? .zarray and the chunks?
if key == '.zarray':
with open(filepath, 'rb') as f:
return f.read()
else:
return np.memmap(filepath,mode='r')
else:
raise KeyError(key)
Its working well for me but I dont really know the inner workings of zarr
so who knows what i might have broken and other features it wont play well
with. I thought the idea might be a basis for an enhancement though. Worth
sharing at least.
Speed up depends on access pattern, compression etc but for the example im
testing im seeing 22 times speed up v a compressed zarr array of the same
dimensions and chunking.
Its only working for reads as that was all i needed and i see the way you
write replaces the whole chunk so memmap writes are not doable.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#265>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qm7nFk8Dk2D9UusMVBhxW0xKz8YOks5t5TengaJpZM4UZGaW>
.
--
If I do not respond to an email within a few days, please feel free to
resend your email and/or contact me by other means.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
|
Thanks for the pointer to #40. It looks like the possibility of a blosc_getitem type of access into a compressed chunk would be a great feature. My data is dense in parts and sparse in other parts it therefore has many empty chunks if i stored the full array it would be too big for my disk. Zarr deals with this nicely. So does a plain numpy memmap on some platforms that use a sparse file that can be bigger then the disk however on windows it seems sparse files cant be bigger then the space available. So at a minimum Zarr gives me cross platform compatibility. Also, most of my use cases will benefit from compressed data so its nice to have both options in one even if i have to make a decision at creation time about which storage is going to be best. Additionally Ive been trying the LRU cache over the directoryStore and seeing some benefits particularly for the uncompressed zarr (not using the memmaped store). I then made a LRU that takes a compressed store and cached the uncompressed data. It gives about a 3 times speed up for me when i get cache hits at the expense of more RAM usage. Anyway i feel im prematurely optimising as a distraction. I should just get on with using zarr for my work.
|
Thanks @artttt for the follow up, it's great to get some feedback on how
zarr is being used.
I think there is a legitimate case for implementing a version of
DirectoryStore that returns mmap instead of bytes from __getitem__.
Python's mmap module looks like it fits the bill nicely. As discussed it
would only provide benefit for arrays without compressor or filters, but it
sounds like this would still be useful.
Regarding making use of something like blosc_getitem, that would be cool to
explore but not a trivial amount of work. I'm unlikely to have time to
explore that myself but would make a fun/challenging project.
Also another interesting idea to implement a caching layer for
decompressed/decoded data. That has come up in other discussions. I'd be
open to discussion of how to provide the right hooks in the API so that
could be implemented without having to hack into the metadata.
…On Tue, 5 Jun 2018, 03:03 artttt, ***@***.***> wrote:
Thanks for the pointer to #40
<#40>. It looks like the
possibility of a blosc_getitem type of access into a compressed chunk would
be a great feature.
My data is dense in parts and sparse in other parts it therefore has many
empty chunks if i stored the full array it would be too big for my disk.
Zarr deals with this nicely. So does a plain numpy memmap on some platforms
that use a sparse file that can be bigger then the disk however on windows
it seems sparse files cant be bigger then the space available. So at a
minimum Zarr gives me cross platform compatibility. Also, most of my use
cases will benefit from compressed data so its nice to have both options in
one even if i have to make a decision at creation time about which storage
is going to be best.
Additionally Ive been trying the LRU cache over the directoryStore and
seeing some benefits particularly for the uncompressed zarr (not using the
memmaped store). I then made a LRU that takes a compressed store and cached
the uncompressed data. It gives about a 3 times speed up for me when i get
cache hits at the expense of more RAM usage.
Anyway i feel im prematurely optimising as a distraction. I should just
get on with using zarr for my work.
from zarr.meta import decode_array_metadata,encode_array_metadata
from zarr.codecs import get_codec
from zarr.compat import OrderedDict_move_to_end
class LRUStoreCacheDecoded(zarr.LRUStoreCache):
""" same as LRUStoreCache but stores chunks decoded for
faster access at the expense of higher RAM usage
NOTE: not fully tested. Will likely break for use beyond reading a zarr.
"""
def __getitem__(self, key):
#this is not the right place for this but for trying it out it will do.
#need to get the compressor to use but then tell the
#next user in the chain that its uncompressed
if key == '.zarray':
value = self._store[key]
meta = decode_array_metadata(value)
# setup compressor
config = meta['compressor']
if config is None:
self._compressor = None
else:
self._compressor = get_codec(config)
meta['compressor']= None
value = encode_array_metadata(meta)
return value
try:
# first try to obtain the value from the cache
with self._mutex:
value = self._values_cache[key]
# cache hit if no KeyError is raised
self.hits += 1
# treat the end as most recently used
OrderedDict_move_to_end(self._values_cache, key)
except KeyError:
# cache miss, retrieve value from the store
value = self._store[key]
####################### added next 2 lines to decode anything read from the store straight away
if self._compressor:
value = self._compressor.decode(value)
with self._mutex:
self.misses += 1
# need to check if key is not in the cache, as it may have been cached
# while we were retrieving the value from the store
if key not in self._values_cache:
self._cache_value(key, value)
return value
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#265 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QtP1rcvT6s1ZTwS4IMNlKhunOsFtks5t5ecKgaJpZM4UZGaW>
.
|
I've a related use case that I believe would benefit from introducing chunk-based decompressed cache. I use zarr for storing data that will be used for training neural networks. For this use case, often times you want to sample random (or almost random) rows from the dataset. If the sampling is mostly localized within a chunk, it would be great if the LRUCache could cache an entire chunk so we can take advantage of spatial locality. For example, I would like to same data points [1, 5, 8, 3, 2], and because these all reside in the same compressed chunk (cached by LRU), only reading the first sample should be slow, and the rest should be already cached in memory. |
FWIW have been thinking about this more lately. Am wondering if we shouldn't try to use Memory-mapped files in all directory store cases. Admittedly there are some exotic filesystems where we will need to fallback to reading the whole chunk into Not to mention Python's Admittedly this is less relevant for this issue, but with PR ( zarr-developers/numcodecs#121 ) we should be able to leverage Memory-mapped files nicely in our compressors. The result being we can stream data from disk (optionally) through compressors into NumPy arrays returned to the user. Should be useful for performance and generally handling large chunks. |
@jakirkham nice ideas. Would be worth some simple benchmarking to verify using mmap does indeed improve performance. It may sound like it should reduce memory copies, but there could be subtleties that make it not so obvious. |
Good point. What would we consider a fair dataset to use for these benchmarks? |
I guess to notice an improvement from removing a memory copy you'd want
something big enough, maybe at least 1M? Also you'd want to feed it to a
very fast codec, e.g., LZ4. For many codecs the overhead of a memory copy
is very small compared to time spent encoding or decoding.
…On Tue, 20 Nov 2018, 18:29 jakirkham ***@***.*** wrote:
Good point. What would we consider a fair dataset to use for these
benchmarks?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#265 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QgcHMLQxt7Jd_u6DRtmK4jsoaGhbks5uxEoNgaJpZM4UZGaW>
.
|
To be fair, the gain of avoiding a memory copy may be negligible given the
typical overhead of reading data from disk. And benchmarking this kind of
thing can be complicated as the OS will do its own memory-mapping, so
benchmark results will not be representative of real-world results when
reading cold from disk. That shouldn't stop us investigating this though,
there may be reasons other than bare performance for switching to use mmap
(e.g., the original use case on this issue).
FWIW as long as using mmap doesn't hurt performance, I'd be tempted to use
it.
On Tue, 20 Nov 2018 at 19:36, Alistair Miles <alimanfoo@googlemail.com>
wrote:
… I guess to notice an improvement from removing a memory copy you'd want
something big enough, maybe at least 1M? Also you'd want to feed it to a
very fast codec, e.g., LZ4. For many codecs the overhead of a memory copy
is very small compared to time spent encoding or decoding.
On Tue, 20 Nov 2018, 18:29 jakirkham ***@***.*** wrote:
> Good point. What would we consider a fair dataset to use for these
> benchmarks?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#265 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAq8QgcHMLQxt7Jd_u6DRtmK4jsoaGhbks5uxEoNgaJpZM4UZGaW>
> .
>
--
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
|
One downside of memory-mapping is that you can not really control the amount of memory used by the OS library for caching (see e.g. numpy/numpy#7732). In a real use case on an HPC system, where there seems to be sufficient memory available, the lib might use all of it and the job gets killed by the scheduler for using more memory than reserved. So keeping the possibility to opt-out in certain cases might be desirable. |
What if we just added a flag to |
FWIW I'd be happy with that if the implementation is straightforward. I'd also be happy with adding a separate store class if that's easier/simpler. |
Partly suggesting a flag as memory-mapping feels like a user optimization (perhaps one that different users want or don't want) as opposed to a fundamentally different way of storing the data (e.g. |
Put together PR ( #377 ), which adds the |
Since PR ( #377 ) was opened, we added PR ( #503 ), which allows users to customize how reading occurs by overriding the class MemoryMappedDirectoryStore(DirectoryStore):
def _fromfile(self, fn):
with open(fn, "rb") as fh:
return memoryview(mmap.mmap(fh.fileno(), 0, prot=mmap.PROT_READ)) This store can then be used with Given a user can do this on their own easily, have turned this into a doc issue ( #1245 ). Closing this out. |
Ive only recently started using zarr but im impressed. well done.
I want to share an experience and a possible enhancement.
In one of my use cases i use vindex heavily across the whole array. I know this is likely a worst use case scenario as zarr is reading many many chunks for a small amount of data in each one.
I was previously using numpy memmap arrays for a similar use and it was much faster so i wondered if i used an uncompressed DirectoryStore if it would read chunks as a memmap. no luck, still reading full chunks. So i had a go at subclassing DirectoryStore to do this.
Its working well for me but I dont really know the inner workings of zarr so who knows what i might have broken and other features it wont play well with. I thought the idea might be a basis for an enhancement though. Worth sharing at least.
Speed up depends on access pattern, compression etc but for the example im testing im seeing 22 times speed up v a compressed zarr array of the same dimensions and chunking.
Its only working for reads as that was all i needed and i see the way you write replaces the whole chunk so memmap writes are not doable.
The text was updated successfully, but these errors were encountered: