Description
Problem: I have a file stored on S3 and I wanted to store a second copy with a different chunk structure (optimized for time-series reads instead of spatial reads). Trying to directly rechunk the data on S3 turned out to be too slow.
Question: Is there a better way to rechunk zarr files on S3 than my code below? E.g. am I missing a flag or something that will speed this up?
Code Sample: Unfortunately this code is not run-able because I don't want to share my data (and I also wrote this from memory since I've released the cloud server where I did the work.)
>>> import s3fs
>>> import zarr
>>> import time
>>> # Point to S3 zarr array
>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='us-west-1'))
>>> store = s3fs.S3Map(root=<my_zarr_file_on_s3, s3=s3, check=False)
>>> # Open the group with 'append' mode so that I can create new file and write to it
>>> root = zarr.open(store=store, mode="a")
>>> z = root["space_chunked"]
>>> z.shape
(1600, 3600, 1800)
>>> z.chunks
(128, 128, 16)
>>> zt = root.create_dataset(
"time_chunked",
shape=z.shape,
chunks=(16, 16, 128),
dtype=z.dtype,
fill_value=z.fill_value
)
>>> for i in range(z.shape[0] // z.chunks[0] + 1):
>>> for j in range(z.shape[1] // z.chunks[1] + 1):
>>> for k in range(z.shape[2] // z.chunks[2] + 1):
>>> slc = tuple([slice(ijk * zt.chunks[ii], (ijk +1) * zt.chunks[ii]) for ii, ijk in enumerate([i, j, k])])
>>> start = time.time()
>>> zt[slc] = z[slc]
>>> print("Updated in {}'.format(time.time() - start))
Results: To write the the initial chunks took on the order of ~0.3 seconds. Then it slowed down to the point where it was ~9 seconds per chunk. At that point I stopped the operation, copied the data to a cloud server, rechunked it there, and then uploaded the data to S3 again.
Discussion: I'm assuming zarr is doing a list operation or something like that during writes? This gets pretty slow on S3...
I described my simplest use-case above. I actually run into this as well when I'm trying to append to a dataset on S3, and while I'm creating a large processed dataset.
I'm using serverless (Lambda functions) for most of these operations, and they end up timing out after a while for some use cases when the number of stored chunks become too large. My hacked solution is to copy the chunks that will be updated locally to temporary storage, update the chunks there, then copy them back to S3.... MUCH faster. It would be nice if Zarr could do something similar under the hood given the correct flags (perhaps skipping some verification).
Version and installation information
Please provide the following:
- Value of
zarr.__version__
: 2.3.2 - Value of
numcodecs.__version__
: 0.6.4 - Version of Python interpreter: 3.7
- Version of
s3fs
: 0.4.2 - Operating system (Linux/Windows/Mac): Linux
- How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): both