-
-
Notifications
You must be signed in to change notification settings - Fork 331
Added Store.getsize #2426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Store.getsize #2426
Changes from all commits
5e0ffe8
1926e19
12963ab
384d323
c39e03c
87d2a9e
8ba85ec
1cdfd6d
7cbc500
ade17d2
7231d7c
81c4b7e
ce548e2
4350e53
5f1d036
a688296
783cfe3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,10 @@ | |
from itertools import starmap | ||
from typing import TYPE_CHECKING, Protocol, runtime_checkable | ||
|
||
from zarr.core.buffer.core import default_buffer_prototype | ||
from zarr.core.common import concurrent_map | ||
from zarr.core.config import config | ||
|
||
if TYPE_CHECKING: | ||
from collections.abc import AsyncGenerator, AsyncIterator, Iterable | ||
from types import TracebackType | ||
|
@@ -344,6 +348,70 @@ async def _get_many( | |
for req in requests: | ||
yield (req[0], await self.get(*req)) | ||
|
||
async def getsize(self, key: str) -> int: | ||
""" | ||
Return the size, in bytes, of a value in a Store. | ||
|
||
Parameters | ||
---------- | ||
key : str | ||
|
||
Returns | ||
------- | ||
nbytes : int | ||
The size of the value (in bytes). | ||
|
||
Raises | ||
------ | ||
FileNotFoundError | ||
When the given key does not exist in the store. | ||
""" | ||
# Note to implementers: this default implementation is very inefficient since | ||
# it requires reading the entire object. Many systems will have ways to get the | ||
# size of an object without reading it. | ||
value = await self.get(key, prototype=default_buffer_prototype()) | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if value is None: | ||
raise FileNotFoundError(key) | ||
return len(value) | ||
|
||
async def getsize_prefix(self, prefix: str) -> int: | ||
""" | ||
Return the size, in bytes, of all values under a prefix. | ||
|
||
Parameters | ||
---------- | ||
prefix : str | ||
The prefix of the directory to measure. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we offer implementers the following in documentation?: This function will be called by zarr using a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure... I was hoping we could somehow ensure that we don't call it with anything other than a group / array / root path, but users can directly use LMK if you want any more specific guidance on what to do (e.g. raise a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
And now I'm noticing that I've done exactly that in |
||
|
||
Returns | ||
------- | ||
nbytes : int | ||
The sum of the sizes of the values in the directory (in bytes). | ||
|
||
See Also | ||
-------- | ||
zarr.Array.nbytes_stored | ||
Store.getsize | ||
|
||
Notes | ||
----- | ||
``getsize_prefix`` is just provided as a potentially faster alternative to | ||
listing all the keys under a prefix calling :meth:`Store.getsize` on each. | ||
|
||
In general, ``prefix`` should be the path of an Array or Group in the Store. | ||
Implementations may differ on the behavior when some other ``prefix`` | ||
is provided. | ||
""" | ||
# TODO: Overlap listing keys with getsize calls. | ||
# Currently, we load the list of keys into memory and only then move | ||
# on to getting sizes. Ideally we would overlap those two, which should | ||
# improve tail latency and might reduce memory pressure (since not all keys | ||
# would be in memory at once). | ||
keys = [(x,) async for x in self.list_prefix(prefix)] | ||
limit = config.get("async.concurrency") | ||
sizes = await concurrent_map(keys, self.getsize, limit=limit) | ||
return sum(sizes) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This materializes the full list of keys in memory, can we maintain the generator longer to avoid that? Also, this has unlimited concurrency, for a potentially very large number of keys. It could easily create millions of async tasks. We should probably run in chunks limited by the value of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't immediately see how that's possible. The best I'm coming up with is a fold-like function that asynchronously iterates through keys from FWIW, it looks like
In 7cbc500 I've hacked in some support for AsyncIterable there. I haven't had enough coffee to figure out what the flow of
is. I'm a bit worried the
Fixed. We should probably replace all instances of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 5f1d036 removed support for I think there's some discussion around improving our use of asyncio to handle cases like this (using queues to mediate task producers like list_prefix and consumers like The unbounded concurrency issue you raised, is still fixed. It's just the loading of keys into memory that's not yet addressed. |
||
|
||
|
||
@runtime_checkable | ||
class ByteGetter(Protocol): | ||
|
Uh oh!
There was an error while loading. Please reload this page.