Added Store.getsize #2426

TomAugspurger · 2024-10-21T12:36:50Z

One difference from Zarr v2, its getsize seemed to return -1 if the concrete backend didn't provide a getsize method. I think returning a "bad" integer like from a function that returns integers is dangerous. I've implemented a slow but correct default that just reads the object and calls len on the bytes.

[Description of PR]

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

Closes zarr-developers#2420

src/zarr/abc/store.py

d-v-b

this looks good, and I like the safe, slow default over returning -1

src/zarr/storage/remote.py

TomAugspurger · 2024-10-23T20:21:51Z

I might be confusing myself, but I think this implementation might not be what we want... I think what users want (like us in #2400) is the size of an Array in storage, not the size of a particular key. I guess we could do something like generate all the keys for a given array and then call store.getsize with each of those keys...

So maybe we do need this, since the store is knows (or can figure out) what bytes are actually stored for a given array. But we also need a bit on top of it to bring it to the array level.

d-v-b · 2024-10-23T20:24:23Z

I guess we could do something like generate all the keys for a given array and then call store.getsize with each of those keys...

In case you want to go this direction, this method is designed for exactly such a use case

jhamman · 2024-10-23T21:18:47Z

I'll throw an idea into the mix. We probably want two things:

Store.get_size(key: str) -> int
Store.get_size_dir(path: str) -> int

Of course, list_dir + get_size would be the same as get_size_dir but some stores will be able to provide a fast path for the dir size.

TomAugspurger · 2024-10-24T12:31:37Z

Thanks. Looking at how Icechunk would implement getsize is what prompted my question so I can se how a getsize_dir makes sense there.

Would you expect the size of metadata documents to show up in total for getsize_dir?

…size

TomAugspurger · 2024-10-24T16:07:07Z

I've pushed an update that adds a getsize_prefix, but am having second thoughts about whether this is worth adding to the API. It's not clear to me that a Store will always have a well-defined size for an array (things like references or store-level sharding complicate things), and so maybe it doesn't make sense to add it to the store interface.

paraseba · 2024-11-04T20:06:00Z

src/zarr/abc/store.py

+        Parameters
+        ----------
+        prefix : str
+            The prefix of the directory to measure.


Can we offer implementers the following in documentation?:

This function will be called by zarr using a prefix that is the path of a group, an array, or the root. Implementations can choose to do undefined behavior when that is not the case.

Sure... I was hoping we could somehow ensure that we don't call it with anything other than a group / array / root path, but users can directly use Store.getsize_prefix and they can do whatever.

LMK if you want any more specific guidance on what to do (e.g. raise a ValueError). I'm hesitant about trying to force required exceptions into an ABC / interface.

I'm hesitant about trying to force required exceptions into an ABC / interface.

And now I'm noticing that I've done exactly that in getsize, with requiring implementations to raise FileNotFoundError if the key isn't found :)

paraseba · 2024-11-04T20:11:30Z

src/zarr/abc/store.py

+        """
+        keys = [x async for x in self.list_prefix(prefix)]
+        sizes = await gather(*[self.getsize(key) for key in keys])
+        return sum(sizes)


This materializes the full list of keys in memory, can we maintain the generator longer to avoid that?

Also, this has unlimited concurrency, for a potentially very large number of keys. It could easily create millions of async tasks. We should probably run in chunks limited by the value of the concurrency setting.

See concurrent_map for an example

This materializes the full list of keys in memory, can we maintain the generator longer to avoid that?

I don't immediately see how that's possible.

The best I'm coming up with is a fold-like function that asynchronously iterates through keys from list_prefix and (asynchronously) calls self.getsize to update the size. Sounds kinda complicated.

FWIW, it looks like concurrent_map wants an iterable of items:

> return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items]) E TypeError: 'async_generator' object is not iterable```

In 7cbc500 I've hacked in some support for AsyncIterable there. I haven't had enough coffee to figure out what the flow of

return await asyncio.gather(*[asyncio.ensure_future(run(item)) async for item in items])

is. I'm a bit worried the async for item in items is happening immediately, so we end up building that list of keys in memory anyway.

We should probably run in chunks limited by the value of the concurrency setting.

Fixed. We should probably replace all instances of asyncio.gather with a concurrency-limited version. I'll make a separate issue for that.

5f1d036 removed support for AsyncIterable in concurrent_map, replacing it with a TODO.

I think there's some discussion around improving our use of asyncio to handle cases like this (using queues to mediate task producers like list_prefix and consumers like getsize) that will address this.

The unbounded concurrency issue you raised, is still fixed. It's just the loading of keys into memory that's not yet addressed.

…size

TomNicholas · 2024-11-05T17:28:11Z

am having second thoughts about whether this is worth adding to the API.

Something like this interface

Store.get_size(key: str) -> int

would be very useful for virtualizarr, as then we can easily and efficiently learn the byte range lengths of all objects in a store, in order to ingest existing zarr as virtual zarr.

EDIT: xref zarr-developers/VirtualiZarr#262 (comment)

…size

jhamman

LGTM. Thanks @TomAugspurger :)

src/zarr/abc/store.py

d-v-b

looks good!

tests/test_array.py

d-v-b · 2024-11-14T15:45:20Z

ah, we are getting some test failures after bringing in the latest changes from main

TomAugspurger · 2024-11-14T16:29:29Z

Should be all set now.

Added Store.getsize

5e0ffe8

Closes zarr-developers#2420

TomAugspurger added the V3 label Oct 21, 2024

TomAugspurger commented Oct 21, 2024

View reviewed changes

src/zarr/abc/store.py Show resolved Hide resolved

d-v-b approved these changes Oct 21, 2024

View reviewed changes

jhamman reviewed Oct 21, 2024

View reviewed changes

src/zarr/storage/remote.py Outdated Show resolved Hide resolved

TomAugspurger added 2 commits October 22, 2024 07:55

fixups

1926e19

lint

12963ab

TomNicholas mentioned this pull request Oct 23, 2024

Add Zarr Reader(s) zarr-developers/VirtualiZarr#262

Closed

2 tasks

TomAugspurger added 4 commits October 24, 2024 07:32

wip

384d323

Use prefix

c39e03c

fixup

87d2a9e

Merge remote-tracking branch 'upstream/main' into tom/feature/object-…

8ba85ec

…size

norlandrhagen mentioned this pull request Oct 24, 2024

Zarr reader zarr-developers/VirtualiZarr#271

Merged

22 tasks

paraseba reviewed Nov 4, 2024

View reviewed changes

TomAugspurger added 2 commits November 5, 2024 08:52

Merge remote-tracking branch 'upstream/main' into tom/feature/object-…

1cdfd6d

…size

Maybe fixup

7cbc500

TomAugspurger added 6 commits November 8, 2024 06:44

lint

ade17d2

Merge remote-tracking branch 'upstream/main' into tom/feature/object-…

7231d7c

…size

revert buffer chnages

81c4b7e

fixup

ce548e2

fixup

4350e53

Remove AsyncIterable support

5f1d036

jhamman approved these changes Nov 13, 2024

View reviewed changes

jhamman requested changes Nov 14, 2024

View reviewed changes

src/zarr/abc/store.py Show resolved Hide resolved

d-v-b approved these changes Nov 14, 2024

View reviewed changes

tests/test_array.py Show resolved Hide resolved

d-v-b and others added 2 commits November 14, 2024 16:33

Merge branch 'main' into tom/feature/object-size

a688296

fixup

783cfe3

jhamman approved these changes Nov 14, 2024

View reviewed changes

jhamman merged commit f74e53a into zarr-developers:main Nov 14, 2024
26 checks passed

Uh oh!

Added Store.getsize #2426

Added Store.getsize #2426

Uh oh!

Conversation

TomAugspurger commented Oct 21, 2024

Uh oh!

Uh oh!

d-v-b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomAugspurger commented Oct 23, 2024

Uh oh!

d-v-b commented Oct 23, 2024

Uh oh!

jhamman commented Oct 23, 2024

Uh oh!

TomAugspurger commented Oct 24, 2024

Uh oh!

TomAugspurger commented Oct 24, 2024

Uh oh!

paraseba Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paraseba Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

paraseba Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomNicholas commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhamman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d-v-b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d-v-b commented Nov 14, 2024

Uh oh!

TomAugspurger commented Nov 14, 2024

Uh oh!

Uh oh!

Uh oh!

TomAugspurger Nov 5, 2024 •

edited

Loading

TomAugspurger Nov 5, 2024 •

edited

Loading

TomAugspurger Nov 8, 2024 •

edited

Loading

TomNicholas commented Nov 5, 2024 •

edited

Loading