-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threading meta-data lookups #575
Comments
cc @rabernat |
It seems this problem could be solved independently of the |
Actually, the bigger problem is generating the list of all the keys which can be slow if there are a lot of keys. Is there a faster way to generate the list of just the metadata keys? This is pretty easy to do for a file-system or using |
Hi @nbren12, this is one of the issues we're trying to address in the v3 spec, where you can list all keys with prefix "meta/" to get just the metadata keys. Many stores have a fast native implementation of listing keys with a given prefix, so this should generally perform well. For v2 implementations this is harder to achieve. Only option I'm aware of is to use separate metadata and chunk stores. I.e., you could do something like this with the current library:
|
Thanks for the reply. It's good to know you are working on a solution for v3. For a flat zarr group, it is very fast to look up all the metadata files provided you can use a filesystem abstraction. E.g.
Use this trick + multi-threading I was able to consolidate the metadata of a zarr with millions of chunks in about 1 second.
Separating the metadata and data in this way does seem like a simple approach. However, the ZarrStore abstraction seems to be in a somewhat awkward middle ground between a Perhaps it would make sense to formalize the concept of an "Index" (which could have a tree-like structure) which would manage keys and provide fast look-ups for groups of keys (e.g. metadata). This would also decouple the naming of the chunks in object storage from how they are interpreted by zarr, which can have performance implications. (I remember reading someplace that naming objects with random uids improves performance in object storage). Just some food for thought. |
Thanks, very good to know this works!
FWIW this is the same middle ground where object stores like GCS, S3, etc. are. IIUC they are effectively a flat mapping of keys (object names) to values (objects), but they do provide efficient functions for querying keys "hierarchically" based on a prefix and a delimiter. It seems reasonable to assume that a range of different store types could offer this same functionality. I.e., stores don't need to support the full file-system abstraction, they just need the ability to support the mapping abstraction plus querying keys based on prefix.
FWIW I think there are two separate questions here. One is how to get fast listing of metadata keys. The other is how to get best I/O performance for reading and writing chunks. WRT the first question, store types like cloud object stores, local file systems and key-value databases should all natively support efficient key lookups based on a prefix query, so I don't think decoupling of keys is necessary. I.e., those stores already implement their own indexing internally. WRT the second question, I have also heard that I/O on cloud object stores is distributed based on the object name and so objects likely to be accessed concurrently shouldn't have similar names, but haven't seen or heard any benchmarking showing that makes a difference in practice. If that does turn out to be true, I think protocol extensions could be designed that decouple zarr keys from storage locations. E.g., zarr-developers/zarr-specs#82.
Much appreciated 😄 |
Note that fsspec-based stores are starting to support concurrent fetches of multiple keys (i.e., |
The main bottleneck here was listing the keys (which is quite costly for millions of chunks). If the fsspec-based stores supported a |
The FSStore in zarr V2 does support listdir; but using consolidated stores (all the metadata in a single file) would still be better, avoiding any listing at all. A clarification on the behaviour of object stores:
|
Thanks. I fully agree that consolidated metadata is the way to go. I'm just proposing a faster way to consolidate metadata on existing stores. Currently, the main performance problem is the
This is good. |
OK, got it. Yes, you could do listdir to get the directory layout (without listing the bottom-level directories, which may contain many files), collect all the .z* files, and then fetch them all in a single call to |
Closing this now that we've implemented asynchronous metadata reading in the |
Coming from the xarray world, one of the main difficulties with working with zarr in the cloud is the slow read-times of the metadata.
consolidate_metadata
is a good solution, but it does take a long time to run. It would be straightforward to add some threading here:zarr-python/zarr/convenience.py
Line 1121 in 27f0726
It might also be nice to use this threaded metadata look-up without mutating the underlying store, which is problematic for clients without write privileges.
The text was updated successfully, but these errors were encountered: