-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Store.getsize #2426
base: main
Are you sure you want to change the base?
Added Store.getsize #2426
Changes from 7 commits
5e0ffe8
1926e19
12963ab
384d323
c39e03c
87d2a9e
8ba85ec
1cdfd6d
7cbc500
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,8 @@ | |
from itertools import starmap | ||
from typing import TYPE_CHECKING, NamedTuple, Protocol, runtime_checkable | ||
|
||
from zarr.core.buffer.core import default_buffer_prototype | ||
|
||
if TYPE_CHECKING: | ||
from collections.abc import AsyncGenerator, Iterable | ||
from types import TracebackType | ||
|
@@ -197,14 +199,17 @@ def __eq__(self, value: object) -> bool: | |
async def get( | ||
self, | ||
key: str, | ||
prototype: BufferPrototype, | ||
prototype: BufferPrototype | None = None, | ||
byte_range: ByteRangeRequest | None = None, | ||
) -> Buffer | None: | ||
"""Retrieve the value associated with a given key. | ||
|
||
Parameters | ||
---------- | ||
key : str | ||
prototype : BufferPrototype, optional | ||
The prototype giving the buffer classes to use for buffers and nbuffers. | ||
By default, :func:`zarr.buffer.default_buffer_prototype` is used. | ||
byte_range : tuple[int | None, int | None], optional | ||
|
||
Returns | ||
|
@@ -386,6 +391,60 @@ async def _get_many( | |
for req in requests: | ||
yield (req[0], await self.get(*req)) | ||
|
||
async def getsize(self, key: str) -> int: | ||
""" | ||
Return the size, in bytes, of a value in a Store. | ||
|
||
Parameters | ||
---------- | ||
key : str | ||
|
||
Returns | ||
------- | ||
nbytes : int | ||
The size of the value (in bytes). | ||
|
||
Raises | ||
------ | ||
FileNotFoundError | ||
When the given key does not exist in the store. | ||
""" | ||
# Note to implementers: this default implementation is very inefficient since | ||
# it requires reading the entire object. Many systems will have ways to get the | ||
# size of an object without reading it. | ||
value = await self.get(key, prototype=default_buffer_prototype()) | ||
if value is None: | ||
raise FileNotFoundError(key) | ||
return len(value) | ||
|
||
async def getsize_prefix(self, prefix: str) -> int: | ||
""" | ||
Return the size, in bytes, of all values under a prefix. | ||
|
||
Parameters | ||
---------- | ||
prefix : str | ||
The prefix of the directory to measure. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we offer implementers the following in documentation?: This function will be called by zarr using a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure... I was hoping we could somehow ensure that we don't call it with anything other than a group / array / root path, but users can directly use LMK if you want any more specific guidance on what to do (e.g. raise a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
And now I'm noticing that I've done exactly that in |
||
|
||
Returns | ||
------- | ||
nbytes : int | ||
The sum of the sizes of the values in the directory (in bytes). | ||
|
||
See Also | ||
-------- | ||
zarr.Array.nbytes_stored | ||
Store.getsize | ||
|
||
Notes | ||
----- | ||
``getsize_prefix`` is just provided as a potentially faster alternative to | ||
listing all the keys under a prefix calling :meth:`Store.getsize` on each. | ||
""" | ||
keys = [x async for x in self.list_prefix(prefix)] | ||
sizes = await gather(*[self.getsize(key) for key in keys]) | ||
return sum(sizes) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This materializes the full list of keys in memory, can we maintain the generator longer to avoid that? Also, this has unlimited concurrency, for a potentially very large number of keys. It could easily create millions of async tasks. We should probably run in chunks limited by the value of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't immediately see how that's possible. The best I'm coming up with is a fold-like function that asynchronously iterates through keys from FWIW, it looks like
In 7cbc5003eb4d397a7cf0f019bad85cd1c3d0927a I've hacked in some support for AsyncIterable there. I haven't had enough coffee to figure out what the flow of
is. I'm a bit worried the
Fixed. We should probably replace all instances of |
||
|
||
|
||
@runtime_checkable | ||
class ByteGetter(Protocol): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
from zarr.core.buffer import default_buffer_prototype | ||
|
||
__all__ = [ | ||
"default_buffer_prototype", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason
default_buffer_prototype()
isn't the default toget
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a good reason to avoid dynamic default arguments, e.g. function invocations (and I think there's a linting rule that complains about such things). But that doesn't change the fact that
default_buffer_prototype()
is the default, which makes it annoying to specify. One idea that came up in the past was to statically associate a store with a buffer prototype, e.g. by making the store class generic, but this would reduce the flexibility of the store... although whether we realistically need the same store instance to return multiple different types of buffers is yet to be seen.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be OK with
defualt_buffer_prototype
being an optional argument (| None
, with a default of None, which means usedefault_buffer_prototype()
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, some value like
None
or the literal string"default"
would be perfect for this, at least in my opinion.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small bits of awkwardness:
None / "default"
forprototype
means when they're implementingget
. I think that's fine, but merits documenting.zarr.core
is private, butzarr.core.buffer
isn't re-exported from anywhere else. I'll add it tozarr.buffer
unless anyone objects.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thread came up in #1661, where I was struggling to understand the purpose of the prototype argument.
I think it would be nice to either statically associate stores with buffer prototypes or remove the prototypes altogether and let a store return a protocol instead of a specific class.