Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform Directory storage to Zip storage. #756

Closed
eddiecong opened this issue May 21, 2021 · 8 comments · Fixed by #763
Closed

Transform Directory storage to Zip storage. #756

eddiecong opened this issue May 21, 2021 · 8 comments · Fixed by #763

Comments

@eddiecong
Copy link

eddiecong commented May 21, 2021

store = zarr.ZipStore("/mnt/test.zip", "r")

Problem description

Hi, sry for bothering, I found this statement inside Zarr official documentation about ZipStorage:
Alternatively, use a DirectoryStore when writing the data, then manually Zip the directory and use the Zip file for subsequent reads.
I am trying to transform a DirectoryStorage format Zarr dataset to a ZipStorage. I use zip operation provided in Linux.
zip -r test.zip test.zarr here test.zarr is a directory storage dataset including three groups. However, when I try to use the codes above to open it, get the error as below:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eddie/miniconda3/envs/train/lib/python3.8/site-packages/zarr/storage.py", line 1445, in __init__
    self.zf = zipfile.ZipFile(path, mode=mode, compression=compression,
  File "/home/eddie/miniconda3/envs/train/lib/python3.8/zipfile.py", line 1190, in __init__
    _check_compression(compression)
  File "/home/eddie/miniconda3/envs/train/lib/python3.8/zipfile.py", line 686, in _check_compression
    raise NotImplementedError("That compression method is not supported")
NotImplementedError: That compression method is not supported

I wonder if my compression method is wrong, and if there some workarounds to transform directory storage to zip storage or some other DB format, cause when the groups rise, the previous storage has so many nodes and not so convenient to transport. Thanks in advance.

Version and installation information

  • Value of zarr.__version__: 2.8.1
  • Value of numcodecs.__version__: 0.7.3
  • Version of Python interpreter: 3.8.0
  • Operating system (Linux/Windows/Mac): linux ubuntu 18.04
  • How Zarr was installed: pip
@pmav99
Copy link
Contributor

pmav99 commented May 27, 2021

@eddiecong I 've also tried to zip and open and I couldn't figure it out. I guess the docs are outdated or something. Nevertheless, you should be able to transform the existing store to the Zip format using something like rechunker

@jakirkham
Copy link
Member

The issue is zip was likely compressed whereas ZipStore expects this to be uncompressed by default. Would either configure zip such that it doesn't use compression or specify to ZipStore how that compression was done

@pmav99
Copy link
Contributor

pmav99 commented May 28, 2021

@jakirkham is right. ZipStore does indeed default to zipfile.ZIP_STORED which means uncompressed:

def __init__(self, path, compression=zipfile.ZIP_STORED, allowZip64=True, mode='a',

I guess that we need to use zip -0 when creating the zip archive. Nevertheless, I still cannot figure it out:

import numpy as np
import pandas as pd
import xarray as xr
import zarr

# create a dataset
lon = np.arange(-180, 180)
lat = np.arange(-90, 91)
timestamps = pd.date_range("2001-01-01", "2001-12-31", name="time", freq="D")
ds = xr.Dataset(
    data_vars=dict(
        aaa=(
            ["lon", "lat", "time"],
            np.random.randint(0, 101, (len(lon), len(lat), len(timestamps))),
        )
    ),
    coords=dict(
        lon=lon,
        lat=lat,
        time=timestamps,
    ),
)

# store the dataset as zarr
ds.to_zarr("foo.zarr")

Now convert to a zip archive using:

zip -0 -r foo.zarr.zip foo.zarr/

And try to open the archive:

ds = xr.open_zarr(zarr.ZipStore("foo.zarr.zip"))

but this throws a GroupNotFoundError

---------------------------------------------------------------------------
GroupNotFoundError                        Traceback (most recent call last)
<ipython-input-3-9b24aa10d19f> in <module>
----> 1 xr.open_zarr(zarr.ZipStore("foo.zarr.zip"))

/scratch/mavropa/venv/lib/python3.8/site-packages/xarray/backends/zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, decode_timedelta, use_cftime, **kwargs)
    673     }
    674 
--> 675     ds = open_dataset(
    676         filename_or_obj=store,
    677         group=group,

/scratch/mavropa/venv/lib/python3.8/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta)
    570 
    571         opener = _get_backend_cls(engine)
--> 572         store = opener(filename_or_obj, **extra_kwargs, **backend_kwargs)
    573 
    574     with close_on_error(store):

/scratch/mavropa/venv/lib/python3.8/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, append_dim, write_region)
    294             zarr_group = zarr.open_consolidated(store, **open_kwargs)
    295         else:
--> 296             zarr_group = zarr.open_group(store, **open_kwargs)
    297         return cls(zarr_group, consolidate_on_close, append_dim, write_region)
    298 

/scratch/mavropa/venv/lib/python3.8/site-packages/zarr/hierarchy.py in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options)
   1166             raise ContainsArrayError(path)
   1167         elif not contains_group(store, path=path):
-> 1168             raise GroupNotFoundError(path)
   1169 
   1170     elif mode == 'w':

GroupNotFoundError: group not found at path ''

@eddiecong
Copy link
Author

Just get back, thanks so much. I believe it is because ZipStorage will compress by default as @jakirkham said. Here is the suggestion for reference, I wonder if it is the relative path problem that causes the GroupNotFoundError @pmav99. [https://stackoverflow.com/questions/67635491/transform-zarr-directory-storage-to-zip-storage/67675357#67675357]

@pmav99
Copy link
Contributor

pmav99 commented May 31, 2021

@eddiecong Not really. I just tried with absolute paths and it still throws the same error. Could someone try to run the snippet I posted on my previous post? Just to confirm that the issue exists.

@joshmoore
Copy link
Member

@pmav99, I see the same error with your code before. If I change:

zip -0 -r foo.zarr.zip foo.zarr/

to

cd foo.zarr
zip -0 -r ../within.zarr.zip .

it works for me.

@pmav99
Copy link
Contributor

pmav99 commented Jun 1, 2021

@joshmoore thank you. I confirm that your proposal does indeed work.

To make this more clear. If the zip archive contains the outer directory, then the ZipStore throws an exception. If the outer directory is omitted then it works just fine.

So this fails:

$ zip -r0 foo.zarr.zip foo.zarr   
$ unzip -l foo.zarr.zip | head 
Archive:  foo.zarr.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2021-06-01 15:39   foo.zarr/
        0  2021-06-01 15:39   foo.zarr/lat/
      316  2021-06-01 15:39   foo.zarr/lat/.zarray
      345  2021-06-01 15:39   foo.zarr/lat/0
       50  2021-06-01 15:39   foo.zarr/lat/.zattrs
       24  2021-06-01 15:39   foo.zarr/.zgroup
        0  2021-06-01 15:39   foo.zarr/aaa/

while this works:

$ cd foo.zarr
$ zip -r0 ../foo.zarr.zip ./
$ cd ../
$ unzip -l foo.zarr.zip | head
Archive:  foo.zarr.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2021-06-01 15:39   lat/
      316  2021-06-01 15:39   lat/.zarray
      345  2021-06-01 15:39   lat/0
       50  2021-06-01 15:39   lat/.zattrs
       24  2021-06-01 15:39   .zgroup
        0  2021-06-01 15:39   aaa/
   184600  2021-06-01 15:39   aaa/0.3.2

# ...

AFAIK there is no way to create a suitable zip archive using zip unless you do this trick with the change of the CWD. Nevertheless, according to this SO answer it is possible to avoid cd-ing into the zarr archive by using 7z instead of zip:

$ 7z a -tzip foo.zarr.zip foo.zarr/.
$ unzip -l foo.zarr.zip | head -n20
Archive:  foo.zarr.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        2  2021-06-01 15:39   .zattrs
       24  2021-06-01 15:39   .zgroup
        0  2021-06-01 15:39   aaa/
      365  2021-06-01 15:39   aaa/.zarray
       81  2021-06-01 15:39   aaa/.zattrs
   196189  2021-06-01 15:39   aaa/0.0.0
   196189  2021-06-01 15:39   aaa/0.0.1

@eddiecong
Copy link
Author

@pmav99 Thanks for the summary. Finally, we decide to use the LMDB storage format, which supports both reads and writes in multiprocessing, by doing so, we did not have to run the additional cmd to zip the directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants