-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata_only
param to .to_zarr
?
#8343
Comments
Yes, is a great idea! |
+1, this is a really nice idea. Related to this could also be a write-through cache of sorts. For high-latency stores (e.g. S3), synchronously populating the store metadata can really add up. If we knew we were only writing metadata, we could safely populate all the Zarr json objects then send them in one bulk write step. The combination of these two features would be a lightning fast Zarr initialization routine 🚀 |
I came across #8343 recently — this seems to be a similar suggestion to what I was intending. Is that correct?
Is anyone more familiar with whether there is a cost to producing the dask task graph? I'm seeing |
If |
Max I think you're right. In recent times, dask has a "lazy graph" (HighLevelGraph) that gets lowered down to an old-style graph expressed as dicts. That lowering is still slow and potentially whats happening here. |
Yeah here's the call to optimize in |
I really think we're just taking advantage of a side-effect by using |
Yes we could either do:
How difficult do we think this is? Is it something I can bang out in 45 mins or is it a bigger effort that requires more context? |
FYI — for the moment, |
In xarray-beam we use zeros_like, which I believe works for any NumPy dtype. |
Could this just be running |
I'm playing around with this piece of code. Does it make sense? There's a fair bit of complexity around the def make_template(ds, *, encoding=None):
fillvalues = {
name: var.encoding["_FillValue"]
for name, var in ds._variables.items()
if var.encoding and "_FillValue" in var.encoding
}
fillvalues.update(
{
name: enc["_FillValue"]
for name, enc in encoding.items()
if "_FillValue" in enc
}
)
to_drop = {var for var, varenc in encoding.items() if "_FillValue" in varenc}
dropped = ds.drop_vars(to_drop)
template = xr.zeros_like(ds)
for var in to_drop:
template[var] = xr.full_like(ds[var], encoding[var]["_FillValue"])
return template
def initialize_zarr(ds, repo, *, region_dims=None, append_dim=None, **kwargs):
if "compute" in kwargs:
raise ValueError("The ``compute`` kwarg is not supported in `initialize_zarr`.")
if kwargs.get("mode", "w") != "w":
raise ValueError(
f"Only mode='w' is allowed for initialize_zarr. Received {kwargs['mode']}"
)
encoding = kwargs.get("encoding", {})
template = make_template(ds, encoding=encoding)
# TODO: handle `write_empty_chunks` in init_kwargs["encoding"]
init_kwargs = kwargs.copy()
init_kwargs.pop("write_empty_chunks", None)
template.to_zarr(
store, group="foo/", compute=False, write_empty_chunks=False, **init_kwargs
)
if region_dims:
# At this point, the store has been initialized (and potentially overwritten)
kwargs.pop("mode")
dropped = ds.drop_dims(region_dims)
new_encoding = kwargs.pop("encoding", None)
if new_encoding:
new_encoding = {k: v for k, v in new_encoding.items() if k in dropped}
dropped.to_zarr(
store,
group="foo/",
**kwargs,
encoding=new_encoding,
compute=True,
mode="a",
)
# can't use drop_dims since that will also remove any variable
# with the dims to be dropped
# even if they have anything in region_dims
dims_to_drop = set(ds.dims) - set(region_dims)
vars_to_drop = [
name
for name, var in ds._variables.items()
if set(var.dims).issubset(dims_to_drop)
]
return ds.drop_vars(vars_to_drop)
elif append_dim:
# TODO
pass
else:
return ds
encoding = {"a": {"_FillValue": -1}}
initialized = initialize_zarr(ds, store, region_dims="y", mode="w", encoding=encoding) |
n00b question — why do we need the code around If there's a dataset with two dims, and they're both (very minor point — we also want to allow |
🤦🏾♂️ I was testing with a numpy dataset, so
Yes. But the indexes for the dims get written during
👍 |
I think the code is doing the right thing — A clearer way to ask my initial question was "If there's a dataset with one array with two dims, and both dims are region_dims, does this drop the array?" — it doesn't drop anything (
I like it! I see there's a logic branch for |
Is your feature request related to a problem?
A leaf from #8245, which has a bullet:
I've also noticed that for large arrays, running
compute=False
can take several minutes, despite the indexes being very small. I think this is because it's building a dask task graph — which is then discarded, since the array is written from different machines with theregion
pattern.Describe the solution you'd like
Would introducing a
metadata_only
parameter toto_zarr
help here:Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: