Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate zarr metadata into single key #268

Merged
merged 39 commits into from
Nov 14, 2018

Conversation

martindurant
Copy link
Member

@martindurant martindurant commented Jun 26, 2018

A simple possible way of scanning all the metadata keys ('.zgroup'...) in a dataset and copying them into a single key, so that on systems where there is a substantial overhead to reading small files, everything can be grabbed in a single read. This is important in the context of xarray, which traverses all groups during opening the dataset, to find the various sub-groups and arrays.

The test shows how you could use the generated key. We could contemplate automatically looking for the metadata key when opening.

REF: pangeo-data/pangeo#309

TODO:

  • Add unit tests and/or doctests in docstrings
  • Unit tests and doctests pass locally under Python 3.6 (e.g., run tox -e py36 or
    pytest -v --doctest-modules zarr)
  • Unit tests pass locally under Python 2.7 (e.g., run tox -e py27 or
    pytest -v zarr)
  • PEP8 checks pass (e.g., run tox -e py36 or flake8 --max-line-length=100 zarr)
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Doctests in tutorial pass (e.g., run tox -e py36 or python -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst)
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@alimanfoo
Copy link
Member

Nice proof of concept, simple and elegant.

@jakirkham
Copy link
Member

Couple initial questions that come to mind:

  • What happens if the metadata changes?
  • What if multiple pieces of metadata change at the same time?

@martindurant
Copy link
Member Author

@jakirkham , there is no mechanism here to update the consolidated metadata, it would need to be rescanned from the rest of the metadata files and rewritten - but the use case here is meant for write-once only. Of course, my approach is very simple.

@alimanfoo
Copy link
Member

Yes the use case is for static data, once you know it's complete then you consolidate all metadata into a single file.

Two small thoughts...

I had wondered about compressing the consolidated metadata file. But then thought in a cloud setting this is unlikely to make a difference unless total size of consolidated metadata is above 10 Mb, which is unlikely unless people are cramming lots into user attributes. Typical size of a .zarray file is ~400 bytes.

Ultimately we'd need to think of some way that users are prevented from attempting to modify a group when using consolidated metadata. Under the current design further modifications would be permitted because the consolidated metadata has been read into a dict which allows modification, but these would obviously not get written back to the metadata file.

@martindurant
Copy link
Member Author

There seem to be several possibilities, here are some thoughts.
It could be argues that implicitly loading from a metadata file should instantiate a dictstore that doesn't allow mutation at all (but data would still be writable); that there would be a flag at load time on whether to consider consolidated or not. Or it could be a metadata store that needs to have sync called, to persist both itself and also update the individual files.
Changes to the underlying directory structures would not, as things stand, get written to the consolidated store, and be out of sync until consolidate is explicitly called.
The presence of consolidated metadata could be used to indicate that the whole dataset hierarchy is read-only. Maybe then, opening can only be possible with some force parameter that deletes to consolidated metadata up front.

@jakirkham
Copy link
Member

Having some read-only option(s) at some level(s) makes sense.

Knowing a bit more about where you plan to use this would be helpful.

@alimanfoo
Copy link
Member

@jakirkham the original motivation for this comes from pangeo-data/pangeo#309. I think this is likely to be a common issue for other pangeo use cases, where existing data are converted to zarr format on some local system then uploaded to cloud object storage for broader (read-only) consumption. It affects pangeo particularly because they use xarray, and xarray needs to read in all metadata for a given hierarchy up front. The latency in listing directories and reading all metadata files from cloud object storage is causing considerable delays.

@alimanfoo
Copy link
Member

...so the proposed solution is that data are converted to zarr on a local system as before, then an additional command is run to consolidate all metadata into a single file, then everything is uploaded into cloud storage, then pangeo users somehow configure their code to read metadata from the consolidated metadata object to speedup opening a dataset via xarray.

@alimanfoo
Copy link
Member

@martindurant thanks for the thoughts. I don't have a concrete suggestion at the moment, but as we discuss options I think it could be useful to have in mind one of the design goals for zarr, which is that in general everything should work well in a setting where multiple concurrent threads or processes may be making modifications simultaneously.

I think this is basically the point @jakirkham was making when he asked "What if multiple pieces of metadata change at the same time?" I think the answer is, when using consolidated metadata, we raise some kind of exception on any attempt to make metadata changes.

@jakirkham
Copy link
Member

Thanks for the context, @alimanfoo. Will think about this a bit more.

@martindurant
Copy link
Member Author

@jakirkham , eager to hear your thoughts. This kind of metadata shortcut could be put on the read-only path only, I suppose, or explicitly opt-in.
It'd be most convenient from zarr's point of view if this can be something optional, but it'd be most convenient from xarray/user's point of view if it's automatic and no extra code elsewhere is needed.

Again, this is for example only, not intended final structure
@martindurant
Copy link
Member Author

After some time has passed, the conversation here has run dry. A brief summary.

The situation remains, that it would be convenient to be able to store zarr metadata in a single top-level entity within a directory structure, to avoid expensive matadata lookups when investigating a zarr group's structure - a problem for xarray data on cloud services. The scenario here is the write-once, read-many situation, although the prospect of having to re-sync metadata following changes to the data structure is one to consider.

In the ongoing conversations around conventions and metadata, I feel there is a wish to make any changes optional and so compatible. An extra file, as in this WIP, would work, but feel very ad-hoc. Adding to the attrs would work very similarly, but the metadata of group contents doesn't feel like an attr. Adding something to the .xgroup would break compatibility. None of these by themselves would solve the sync problem.

The sync problem can be partially solved by simple means, such as: checking for and reading from consolidated metadata can only happen when read-only, and opening in any write mode deletes the metadata. This does not prevent changes to lower-levels in the hierarchy, though, since zarr can access them directly; xarray cannot do that, and so there is an argument that this logic belongs in xarray.

@alimanfoo
Copy link
Member

alimanfoo commented Jul 30, 2018 via email

@martindurant
Copy link
Member Author

Are you suggesting that the new Consolidated class should live in zarr, and that calling an open function with some keyword would activate its usage? I think that makes sense.
I wouldn't like to go the other way and to have to force external libraries to choose how they will do their mapping, or for something like the gcsfs mapper to have to subclass from zarr to get he right behaviour. (currently they only need to be MutibleMappings or any dict) I'm still not sure any of this solves the sync problem, because you can always open a directory without the consolidated class in the way, or simply point to a lower level in the structure. Caveat emptor, I suppose... we don't want to be putting datestamp of checksums in here!

@alimanfoo
Copy link
Member

I haven't worked this fully through yet. But I was thinking something like the following (naming of new functions, classes and arguments subject to discussion) ...

User A wants to use zarr in the cloud in a write-once-read-many style. They are not using xarray. First they create the data in the cloud, e.g.:

base_store = gcsfs.GCSMap(...)
root = zarr.group(store=base_store)
# create sub-groups and arrays under root, put data into arrays, etc.

When they're finished writing to root, they consolidate the metadata with an explicit call, e.g.:

zarr.consolidate_metadata(base_store, key='.zmetadata')

Later, when they want to read the data, they do e.g.:

base_store = gcsfs.GCSMap(...)
store = zarr.StoreWithConsolidatedMetadata(base_store, key='.zmetadata')
root = zarr.Group(store=store)
baz = root['foo/bar/baz']
# read data etc.

In practice the key='.zmetadata' argument could be omitted by the user because it is the default, but showing here to be explicit.

The outer store could be read-only for simplicity, or could allow writes which are passed through to the underlying base_store, but without updating the consolidated metadata (which would require another explicit call to consolidate_metadata()).

User B is using xarray, and is copying data from some NetCDF4 files into zarr on local disk first, then copying files up to the cloud, then using xarray to read the data. E.g., first copy from NetCDF4 to zarr locally:

root = xarray.open_dataset('/local/path/to/data.nc')
root.to_zarr('/local/path/to/data.zarr', consolidate=True, metadata_key='.zmetadata')

...then copy files up to GCS, then to read from GCS do:

store = gcsfs.GCSMap(...)
root = xarray.open_zarr(store, consolidated=True, metadata_key='.zmetadata')

Again the metadata_key='.zmetadata' could be omitted because it is default, but showing for completeness.

There's probably also a use-case to account for involving user making dask API calls using from_zarr() and to_zarr(), haven't thought that through yet.

What do you think about the basic approach?

@alimanfoo
Copy link
Member

On second thoughts, what if the zarr public API is just like this. One function to explicitly consolidate metadata:

zarr.consolidate_metadata(store=base_store, key='.zmetadata')

...and one function to open a group with consolidated metadata:

root = zarr.open_consolidated(store=base_store, key='.zmetadata')

All other details of how consolidation is handled are hidden, i.e., not part of the public API.

Is root (return value of zarr.open_consolidated()) read-only? I suggest no. I.e., changes can be made, including modifying data in arrays, and creating new groups and arrays. All changes are written through to the base store. However, changes to metadata (e.g., creating new groups and arrays) require an explicit call to zarr.consolidate_metadata() to update the consolidated metadata.

@martindurant
Copy link
Member Author

martindurant commented Aug 1, 2018

Perhaps even simpler?

root = zarr.open(store=base_store, consolidated_key='.zmetadata')

Then again, if a change is requires in xarray (and elsewhere) to use the consolidated store, then could as well have the separate function. However, would want some way to "use consolidated if available", and I'm assuming you wouldn't want to file extra keywords into the base open function. Some of my code has suffered from this (fastparquet.write).

For the implementation as far as the wrapper is concerned and the read-only question, I think I agree with you.

@martindurant
Copy link
Member Author

@alimanfoo , I implemented your suggested over-layer class. This is optional.
There still could be a flag in open() to attempt to use this, or it could be left to external libraries like xarray to opt in.

@martindurant
Copy link
Member Author

In addition, I could imagine enabling writing in the class, by starting with an empty dict if the metadata key doesn't exist yet, have metadata writes affect both that dict and the backend store, and having some "flush" method to write the current state of the metadata dict. Then, maybe you wouldn't need to call the consolidate function explicitly.

@alimanfoo
Copy link
Member

Thanks @martindurant for moving this forward. Unfortunately I'm offline now for 3 weeks and have run out of time to give any feedback, but hopefully others can comment, and I'll be very happy to push on getting this into master when I'm back.

@alimanfoo
Copy link
Member

OK, this looks all good to me. Any objections to merging?

@jakirkham
Copy link
Member

Not from me.

It would be nice if at least one of the big users of this functionality gave it a quick look. Perhaps @rabernat or @jhamman?

@martindurant
Copy link
Member Author

green!

@mrocklin
Copy link
Contributor

mrocklin commented Nov 2, 2018

cc @jacobtomlinson as well

@rabernat
Copy link
Contributor

rabernat commented Nov 2, 2018

I just tried consolidate_metadata on one of my existing stores and it appeared to work. Then I tried open_consolidated and it also worked.

The one weird thing is that the .zmetadata file was encoded with escaped newline characters \n, so that it is all one single really long line. This makes it hard to view and edit with a text editor. This did not affect the actual functionality, but I feel it should be fixed to preserve human readability of the metadata.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 2, 2018 via email

@alimanfoo
Copy link
Member

Thinking a bit more about @rabernat's comment, I realised there was a fairly straightforward way to workaround the technical issues I mentioned above and implement a format for the consolidated metadata that's a bit easier to read/edit. I've pushed the changes in commit 9c0c621 but very happy to discuss and revert if anything looks off. Here's an example of the new format:

>>> import zarr
>>> store = dict()
>>> z = zarr.group(store)
>>> z.create_group('g1')
<zarr.hierarchy.Group '/g1'>
>>> g2 = z.create_group('g2')
>>> g2.attrs['hello'] = 'world'
>>> arr = g2.create_dataset('arr', shape=(20, 20), chunks=(5, 5), dtype='f8')
>>> arr.attrs['data'] = 1
>>> arr[:] = 1.0
>>> zarr.consolidate_metadata(store)
<zarr.hierarchy.Group '/'>
>>> print(store['.zmetadata'].decode())
{
    "metadata": {
        ".zgroup": {
            "zarr_format": 2
        },
        "g1/.zgroup": {
            "zarr_format": 2
        },
        "g2/.zattrs": {
            "hello": "world"
        },
        "g2/.zgroup": {
            "zarr_format": 2
        },
        "g2/arr/.zarray": {
            "chunks": [
                5,
                5
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": 0.0,
            "filters": null,
            "order": "C",
            "shape": [
                20,
                20
            ],
            "zarr_format": 2
        },
        "g2/arr/.zattrs": {
            "data": 1
        }
    },
    "zarr_consolidated_format": 1
}

@martindurant
Copy link
Member Author

I'm in favour, it looks nice.

return (key.endswith('.zarray') or key.endswith('.zgroup') or
key.endswith('.zattrs'))

# out = {key: store[key].decode() for key in store if is_zarr_key(key)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we drop this line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks for the catch.

@rabernat
Copy link
Contributor

lgtm?

@alimanfoo
Copy link
Member

Alright, this one is going in!

@alimanfoo alimanfoo merged commit d193a78 into zarr-developers:master Nov 14, 2018
@alimanfoo alimanfoo added this to the v2.3 milestone Nov 14, 2018
@alimanfoo alimanfoo mentioned this pull request Nov 14, 2018
4 tasks
@rabernat
Copy link
Contributor

Does anyone have an example of using this new feature with xarray? I'm not able to get it to work.

What I'm doing. (Not sure if this is the intended usage.)

import xarray as xr
import gcsfs
import zarr
path = 'pangeo-data/newman-met-ensemble'
store = gcsfs.GCSMap(path)
zs_cons = zarr.storage.ConsolidatedMetadataStore(store)
ds_orig = xr.open_zarr(store)
ds_cons = xr.open_zarr(zs_cons, decode_times=False)

The array data in the consolidated metadata is mangled compared to the original. Also, possibly related, store['time/.zarray'] returns bytes while zs_cons['time/.zarray'] returns a dict.

What is the recommended way to open my newly consolidated store from xarray?

@alimanfoo
Copy link
Member

I think xarray.open_zarr() would need some minor modification to support consolidated metadata. At least a few options I could see...

(1) Low level solution. Add support for a chunk_store argument in xarray.open_zarr(). Then user could do:

store = gcsfs.GCSMap(path)
cons = zarr.storage.ConsolidatedMetadataStore(store)
ds = xr.open_zarr(store=cons, chunk_store=store)

(2) Higher-level solution. Add a consolidated=False argument to xarray.open_zarr(). If False get current behaviour, delegate to zarr.open_group(). If True delegate to zarr.open_consolidated(). So then user would do:

store = gcsfs.GCSMap(path)
ds = xr.open_zarr(store=cons, consolidated=True)

(3) Auto-detect. Don't change xarray.open_zarr() signature. Instead look for presence of `.zmetadata' key in store, if present delegate to zarr.open_consolidated(). No change to current user code.

Also not sure how this all interacts with possibility to add support for consolidated metadata in intake, @martindurant?

@martindurant
Copy link
Member Author

I prefer scenario (2), where it is user choice (or an argument in an intake catalog), since this is still an experimental feature, but no extra lines of code.

@rabernat
Copy link
Contributor

See pydata/xarray#2558

@rabernat
Copy link
Contributor

FYI, the consolidated API for xarray is being discussed here: pydata/xarray#2559 (comment)

Would welcome input.

@@ -165,6 +165,9 @@ def _load_metadata_nosync(self):
if config is None:
self._compressor = None
else:
# temporary workaround for
# https://github.com/zarr-developers/numcodecs/issues/78
config = dict(config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting in PR ( #361 ) as this was fixed in Numcodecs 0.6.0 with PR ( zarr-developers/numcodecs#79 ). As we now require Numcodecs 0.6.0+ in Zarr, we get the fix and thus no longer need the workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants