-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate zarr metadata into single key #268
Consolidate zarr metadata into single key #268
Conversation
Nice proof of concept, simple and elegant. |
Couple initial questions that come to mind:
|
@jakirkham , there is no mechanism here to update the consolidated metadata, it would need to be rescanned from the rest of the metadata files and rewritten - but the use case here is meant for write-once only. Of course, my approach is very simple. |
Yes the use case is for static data, once you know it's complete then you consolidate all metadata into a single file. Two small thoughts... I had wondered about compressing the consolidated metadata file. But then thought in a cloud setting this is unlikely to make a difference unless total size of consolidated metadata is above 10 Mb, which is unlikely unless people are cramming lots into user attributes. Typical size of a .zarray file is ~400 bytes. Ultimately we'd need to think of some way that users are prevented from attempting to modify a group when using consolidated metadata. Under the current design further modifications would be permitted because the consolidated metadata has been read into a dict which allows modification, but these would obviously not get written back to the metadata file. |
There seem to be several possibilities, here are some thoughts. |
Having some read-only option(s) at some level(s) makes sense. Knowing a bit more about where you plan to use this would be helpful. |
@jakirkham the original motivation for this comes from pangeo-data/pangeo#309. I think this is likely to be a common issue for other pangeo use cases, where existing data are converted to zarr format on some local system then uploaded to cloud object storage for broader (read-only) consumption. It affects pangeo particularly because they use xarray, and xarray needs to read in all metadata for a given hierarchy up front. The latency in listing directories and reading all metadata files from cloud object storage is causing considerable delays. |
...so the proposed solution is that data are converted to zarr on a local system as before, then an additional command is run to consolidate all metadata into a single file, then everything is uploaded into cloud storage, then pangeo users somehow configure their code to read metadata from the consolidated metadata object to speedup opening a dataset via xarray. |
@martindurant thanks for the thoughts. I don't have a concrete suggestion at the moment, but as we discuss options I think it could be useful to have in mind one of the design goals for zarr, which is that in general everything should work well in a setting where multiple concurrent threads or processes may be making modifications simultaneously. I think this is basically the point @jakirkham was making when he asked "What if multiple pieces of metadata change at the same time?" I think the answer is, when using consolidated metadata, we raise some kind of exception on any attempt to make metadata changes. |
Thanks for the context, @alimanfoo. Will think about this a bit more. |
@jakirkham , eager to hear your thoughts. This kind of metadata shortcut could be put on the read-only path only, I suppose, or explicitly opt-in. |
Again, this is for example only, not intended final structure
After some time has passed, the conversation here has run dry. A brief summary. The situation remains, that it would be convenient to be able to store zarr metadata in a single top-level entity within a directory structure, to avoid expensive matadata lookups when investigating a zarr group's structure - a problem for xarray data on cloud services. The scenario here is the write-once, read-many situation, although the prospect of having to re-sync metadata following changes to the data structure is one to consider. In the ongoing conversations around conventions and metadata, I feel there is a wish to make any changes optional and so compatible. An extra file, as in this WIP, would work, but feel very ad-hoc. Adding to the attrs would work very similarly, but the metadata of group contents doesn't feel like an attr. Adding something to the .xgroup would break compatibility. None of these by themselves would solve the sync problem. The sync problem can be partially solved by simple means, such as: checking for and reading from consolidated metadata can only happen when read-only, and opening in any write mode deletes the metadata. This does not prevent changes to lower-levels in the hierarchy, though, since zarr can access them directly; xarray cannot do that, and so there is an argument that this logic belongs in xarray. |
Thanks Martin. FWIW I think there are or will be people wanting to use zarr
in the cloud but not via xarray, so something to consider. (E.g., we've
just today got our own pangeo-but-for-malaria-genomics up on GKE, we use
zarr and dask but not currently xarray and expect to hit the metadata issue
at some point.)
What about something like the following...
Zarr implements a function to consolidate metadata and store, pretty much
just as you have implemented. E.g., calling:
zarr.consolidate_metadata(store, key='.zmetadata', path=None)
...will consolidate all zarr metadata found in store, optionally under
path, and put the consolidated metadata back into the store under the given
key.
Zarr then implements a store class that understands consolidated metadata.
E.g.:
base_store = zarr.DirectoryStore('/path/to/data') # or could be any
underlying mapping class
store = zarr.StoreWithConsolidatedMetadata(base_store, key='.zmetadata',
path=None)
(Class name is obviously horribly too long, but just a placeholder for the
moment.)
...then uses this to open a group, e.g.:
root = zarr.Group(store=store)
I.e., all the logic of handling the consolidated metadata is encapsulated
within the StoreWithConsolidatedMetadata class. Internally it could load
the consolidated metadata, then implement some kind of fall back whereby
keys are first looked up in consolidated metadata, but if not found are
then attempted to be looked up in the underlying base store.
If a package like xarray wants to make this even easier for the user, they
could implement some check for presence of .zmetadata key and do this setup
for the user. But the basic functionality is available without xarray.
Also just to note IMO this solution does not require any change to the
storage spec, as the storage spec only requires that a key/value (i.e.,
mapping) interface is presented to zarr. The details of how keys and values
are stored behind the mapping interface is entirely up to the
implementation. I.e., if using a file system, keys and values do not have
to correspond to file names and file contents. Similarly for cloud storage.
…On Monday, 30 July 2018, Martin Durant ***@***.***> wrote:
After some time has passed, the conversation here has run dry. A brief
summary.
The situation remains, that it would be convenient to be able to store
zarr metadata in a single top-level entity within a directory structure, to
avoid expensive matadata lookups when investigating a zarr group's
structure - a problem for xarray data on cloud services. The scenario here
is the write-once, read-many situation, although the prospect of having to
re-sync metadata following changes to the data structure is one to consider.
In the ongoing conversations around conventions and metadata, I feel there
is a wish to make any changes optional and so compatible. An extra file, as
in this WIP, would work, but feel very ad-hoc. Adding to the attrs would
work very similarly, but the metadata of group contents doesn't feel like
an attr. Adding something to the .xgroup would break compatibility. None of
these by themselves would solve the sync problem.
The sync problem can be partially solved by simple means, such as:
checking for and reading from consolidated metadata can only happen when
read-only, and opening in any write mode deletes the metadata. This does
not prevent changes to lower-levels in the hierarchy, though, since zarr
can access them directly; xarray cannot do that, and so there is an
argument that this logic belongs in xarray.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#268 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QnB6Nb2jD9C0W0-7r7i7X1qKFMoHks5uLx_ogaJpZM4U4pmx>
.
--
If I do not respond to an email within a few days, please feel free to
resend your email and/or contact me by other means.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
|
Are you suggesting that the new Consolidated class should live in zarr, and that calling an open function with some keyword would activate its usage? I think that makes sense. |
I haven't worked this fully through yet. But I was thinking something like the following (naming of new functions, classes and arguments subject to discussion) ... User A wants to use zarr in the cloud in a write-once-read-many style. They are not using xarray. First they create the data in the cloud, e.g.: base_store = gcsfs.GCSMap(...)
root = zarr.group(store=base_store)
# create sub-groups and arrays under root, put data into arrays, etc. When they're finished writing to root, they consolidate the metadata with an explicit call, e.g.: zarr.consolidate_metadata(base_store, key='.zmetadata') Later, when they want to read the data, they do e.g.:
In practice the The outer User B is using xarray, and is copying data from some NetCDF4 files into zarr on local disk first, then copying files up to the cloud, then using xarray to read the data. E.g., first copy from NetCDF4 to zarr locally: root = xarray.open_dataset('/local/path/to/data.nc')
root.to_zarr('/local/path/to/data.zarr', consolidate=True, metadata_key='.zmetadata') ...then copy files up to GCS, then to read from GCS do: store = gcsfs.GCSMap(...)
root = xarray.open_zarr(store, consolidated=True, metadata_key='.zmetadata') Again the There's probably also a use-case to account for involving user making dask API calls using from_zarr() and to_zarr(), haven't thought that through yet. What do you think about the basic approach? |
On second thoughts, what if the zarr public API is just like this. One function to explicitly consolidate metadata: zarr.consolidate_metadata(store=base_store, key='.zmetadata') ...and one function to open a group with consolidated metadata: root = zarr.open_consolidated(store=base_store, key='.zmetadata') All other details of how consolidation is handled are hidden, i.e., not part of the public API. Is |
Perhaps even simpler?
Then again, if a change is requires in xarray (and elsewhere) to use the consolidated store, then could as well have the separate function. However, would want some way to "use consolidated if available", and I'm assuming you wouldn't want to file extra keywords into the base For the implementation as far as the wrapper is concerned and the read-only question, I think I agree with you. |
@alimanfoo , I implemented your suggested over-layer class. This is optional. |
In addition, I could imagine enabling writing in the class, by starting with an empty dict if the metadata key doesn't exist yet, have metadata writes affect both that dict and the backend store, and having some "flush" method to write the current state of the metadata dict. Then, maybe you wouldn't need to call the consolidate function explicitly. |
Thanks @martindurant for moving this forward. Unfortunately I'm offline now for 3 weeks and have run out of time to give any feedback, but hopefully others can comment, and I'll be very happy to push on getting this into master when I'm back. |
OK, this looks all good to me. Any objections to merging? |
green! |
cc @jacobtomlinson as well |
I just tried The one weird thing is that the |
The one weird thing is that the .zmetadata file was encoded with escaped
newline characters \n, so that it is all one single really long line.
This makes it hard to view and edit with a text editor. This did not affect
the actual functionality, but I feel it should be fixed to preserve human
readability of the metadata.
Really interesting you brought that up. I had thought the same, e.g., the
consolidated metadata file could include the consolidated files as JSON
objects, rather than as strings of escaped JSON (hope that makes sense,
happy to clarify if not). The only reason I didn't immediately suggest that
was a quirk of the way the architecture is currently structured would mean
that the JSON objects would need to go through an extra serialisation and
deserialisation, which is somewhat anathema to the spirit of efficiency and
might have a performance impact (although could be negligible, haven't
measured it).
Bottom line, if you and others think it would be valuable to make the
consolidated metadata file more human readable/editable - which is very
much in the spirit of zarr data being very "hackable" - I'd be happy to
unpack what I said above and explore ways of working around current
architecture limitations.
—
… You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#268 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QriXYjpK_p2LL7ZzG2ESkHOTPVKCks5urK3PgaJpZM4U4pmx>
.
|
Thinking a bit more about @rabernat's comment, I realised there was a fairly straightforward way to workaround the technical issues I mentioned above and implement a format for the consolidated metadata that's a bit easier to read/edit. I've pushed the changes in commit 9c0c621 but very happy to discuss and revert if anything looks off. Here's an example of the new format: >>> import zarr
>>> store = dict()
>>> z = zarr.group(store)
>>> z.create_group('g1')
<zarr.hierarchy.Group '/g1'>
>>> g2 = z.create_group('g2')
>>> g2.attrs['hello'] = 'world'
>>> arr = g2.create_dataset('arr', shape=(20, 20), chunks=(5, 5), dtype='f8')
>>> arr.attrs['data'] = 1
>>> arr[:] = 1.0
>>> zarr.consolidate_metadata(store)
<zarr.hierarchy.Group '/'>
>>> print(store['.zmetadata'].decode())
{
"metadata": {
".zgroup": {
"zarr_format": 2
},
"g1/.zgroup": {
"zarr_format": 2
},
"g2/.zattrs": {
"hello": "world"
},
"g2/.zgroup": {
"zarr_format": 2
},
"g2/arr/.zarray": {
"chunks": [
5,
5
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "<f8",
"fill_value": 0.0,
"filters": null,
"order": "C",
"shape": [
20,
20
],
"zarr_format": 2
},
"g2/arr/.zattrs": {
"data": 1
}
},
"zarr_consolidated_format": 1
} |
I'm in favour, it looks nice. |
zarr/convenience.py
Outdated
return (key.endswith('.zarray') or key.endswith('.zgroup') or | ||
key.endswith('.zattrs')) | ||
|
||
# out = {key: store[key].decode() for key in store if is_zarr_key(key)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we drop this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thanks for the catch.
lgtm? |
Alright, this one is going in! |
Does anyone have an example of using this new feature with xarray? I'm not able to get it to work. What I'm doing. (Not sure if this is the intended usage.) import xarray as xr
import gcsfs
import zarr
path = 'pangeo-data/newman-met-ensemble'
store = gcsfs.GCSMap(path)
zs_cons = zarr.storage.ConsolidatedMetadataStore(store)
ds_orig = xr.open_zarr(store)
ds_cons = xr.open_zarr(zs_cons, decode_times=False) The array data in the consolidated metadata is mangled compared to the original. Also, possibly related, What is the recommended way to open my newly consolidated store from xarray? |
I think (1) Low level solution. Add support for a
(2) Higher-level solution. Add a
(3) Auto-detect. Don't change Also not sure how this all interacts with possibility to add support for consolidated metadata in intake, @martindurant? |
I prefer scenario (2), where it is user choice (or an argument in an intake catalog), since this is still an experimental feature, but no extra lines of code. |
FYI, the consolidated API for xarray is being discussed here: pydata/xarray#2559 (comment) Would welcome input. |
@@ -165,6 +165,9 @@ def _load_metadata_nosync(self): | |||
if config is None: | |||
self._compressor = None | |||
else: | |||
# temporary workaround for | |||
# https://github.com/zarr-developers/numcodecs/issues/78 | |||
config = dict(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting in PR ( #361 ) as this was fixed in Numcodecs 0.6.0 with PR ( zarr-developers/numcodecs#79 ). As we now require Numcodecs 0.6.0+ in Zarr, we get the fix and thus no longer need the workaround.
A simple possible way of scanning all the metadata keys (
'.zgroup'
...) in a dataset and copying them into a single key, so that on systems where there is a substantial overhead to reading small files, everything can be grabbed in a single read. This is important in the context of xarray, which traverses all groups during opening the dataset, to find the various sub-groups and arrays.The test shows how you could use the generated key. We could contemplate automatically looking for the metadata key when opening.
REF: pangeo-data/pangeo#309
TODO:
tox -e py36
orpytest -v --doctest-modules zarr
)tox -e py27
orpytest -v zarr
)tox -e py36
orflake8 --max-line-length=100 zarr
)tox -e py36
orpython -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst
)tox -e docs
)