Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Store.getsize #2426

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

TomAugspurger
Copy link
Contributor

Closes #2420

One difference from Zarr v2, its getsize seemed to return -1 if the concrete backend didn't provide a getsize method. I think returning a "bad" integer like from a function that returns integers is dangerous. I've implemented a slow but correct default that just reads the object and calls len on the bytes.

[Description of PR]

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@TomAugspurger TomAugspurger added the V3 Affects the v3 branch label Oct 21, 2024
# Note to implementers: this default implementation is very inefficient since
# it requires reading the entire object. Many systems will have ways to get the
# size of an object without reading it.
value = await self.get(key, prototype=default_buffer_prototype())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason default_buffer_prototype() isn't the default to get?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a good reason to avoid dynamic default arguments, e.g. function invocations (and I think there's a linting rule that complains about such things). But that doesn't change the fact that default_buffer_prototype() is the default, which makes it annoying to specify. One idea that came up in the past was to statically associate a store with a buffer prototype, e.g. by making the store class generic, but this would reduce the flexibility of the store... although whether we realistically need the same store instance to return multiple different types of buffers is yet to be seen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be OK with defualt_buffer_prototype being an optional argument (| None, with a default of None, which means use default_buffer_prototype()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, some value like None or the literal string "default" would be perfect for this, at least in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small bits of awkwardness:

  1. We'll now require Store implementors to read the documentation to understand what the value of None / "default" for prototype means when they're implementing get. I think that's fine, but merits documenting.
  2. I might have missed it, but IIUC zarr.core is private, but zarr.core.buffer isn't re-exported from anywhere else. I'll add it to zarr.buffer unless anyone objects.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea that came up in the past was to statically associate a store with a buffer prototype

This thread came up in #1661, where I was struggling to understand the purpose of the prototype argument.

I think it would be nice to either statically associate stores with buffer prototypes or remove the prototypes altogether and let a store return a protocol instead of a specific class.

Copy link
Contributor

@d-v-b d-v-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, and I like the safe, slow default over returning -1

src/zarr/storage/remote.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

I might be confusing myself, but I think this implementation might not be what we want... I think what users want (like us in #2400) is the size of an Array in storage, not the size of a particular key. I guess we could do something like generate all the keys for a given array and then call store.getsize with each of those keys...

So maybe we do need this, since the store is knows (or can figure out) what bytes are actually stored for a given array. But we also need a bit on top of it to bring it to the array level.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 23, 2024

I guess we could do something like generate all the keys for a given array and then call store.getsize with each of those keys...

In case you want to go this direction, this method is designed for exactly such a use case

@jhamman
Copy link
Member

jhamman commented Oct 23, 2024

I'll throw an idea into the mix. We probably want two things:

  • Store.get_size(key: str) -> int
  • Store.get_size_dir(path: str) -> int

Of course, list_dir + get_size would be the same as get_size_dir but some stores will be able to provide a fast path for the dir size.

@TomAugspurger
Copy link
Contributor Author

Thanks. Looking at how Icechunk would implement getsize is what prompted my question so I can se how a getsize_dir makes sense there.

Would you expect the size of metadata documents to show up in total for getsize_dir?

@TomAugspurger
Copy link
Contributor Author

I've pushed an update that adds a getsize_prefix, but am having second thoughts about whether this is worth adding to the API. It's not clear to me that a Store will always have a well-defined size for an array (things like references or store-level sharding complicate things), and so maybe it doesn't make sense to add it to the store interface.

Parameters
----------
prefix : str
The prefix of the directory to measure.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we offer implementers the following in documentation?:

This function will be called by zarr using a prefix that is the path of a group, an array, or the root. Implementations can choose to do undefined behavior when that is not the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure... I was hoping we could somehow ensure that we don't call it with anything other than a group / array / root path, but users can directly use Store.getsize_prefix and they can do whatever.

LMK if you want any more specific guidance on what to do (e.g. raise a ValueError). I'm hesitant about trying to force required exceptions into an ABC / interface.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant about trying to force required exceptions into an ABC / interface.

And now I'm noticing that I've done exactly that in getsize, with requiring implementations to raise FileNotFoundError if the key isn't found :)

"""
keys = [x async for x in self.list_prefix(prefix)]
sizes = await gather(*[self.getsize(key) for key in keys])
return sum(sizes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This materializes the full list of keys in memory, can we maintain the generator longer to avoid that?

Also, this has unlimited concurrency, for a potentially very large number of keys. It could easily create millions of async tasks. We should probably run in chunks limited by the value of the concurrency setting.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See concurrent_map for an example

Copy link
Contributor Author

@TomAugspurger TomAugspurger Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This materializes the full list of keys in memory, can we maintain the generator longer to avoid that?

I don't immediately see how that's possible.

The best I'm coming up with is a fold-like function that asynchronously iterates through keys from list_prefix and (asynchronously) calls self.getsize to update the size. Sounds kinda complicated.

FWIW, it looks like concurrent_map wants an iterable of items:

>           return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
E           TypeError: 'async_generator' object is not iterable```

In 7cbc5003eb4d397a7cf0f019bad85cd1c3d0927a I've hacked in some support for AsyncIterable there. I haven't had enough coffee to figure out what the flow of

return await asyncio.gather(*[asyncio.ensure_future(run(item)) async for item in items])

is. I'm a bit worried the async for item in items is happening immediately, so we end up building that list of keys in memory anyway.

We should probably run in chunks limited by the value of the concurrency setting.

Fixed. We should probably replace all instances of asyncio.gather with a concurrency-limited version. I'll make a separate issue for that.

@TomNicholas
Copy link
Member

TomNicholas commented Nov 5, 2024

am having second thoughts about whether this is worth adding to the API.

Something like this interface

Store.get_size(key: str) -> int

would be very useful for virtualizarr, as then we can easily and efficiently learn the byte range lengths of all objects in a store, in order to ingest existing zarr as virtual zarr.

EDIT: xref zarr-developers/VirtualiZarr#262 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Affects the v3 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v3]: Add Store method for getting the size of an item.
6 participants