Description
Zarr version
v3
Numcodecs version
n/a
Python Version
n/a
Operating System
n/a
Installation
n/a
Description
While playing around with v3 a bit, I noticed that the blosc codec wasn't compressing my sample data as well as I expected to, and I have confirmed after multiple comparisons that I am getting different results between v3 and v2, in some cases v3 blosc compressed chunks end up being 10-20x larger than in v2 (see below).
Steps to reproduce
import numpy as np
import zarr
import zarr.v2 as zarr_v2
import numpy as np
from numcodecs import Blosc
a = np.random.rand(1000000)
# v3
codecs = [zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()]
z3 = zarr.array(a, chunks=(10000), codecs=codecs)
v3_size = len(z3.store_path.store._store_dict["c/0"]) # 75136
# v2
compressor = Blosc("zstd") # Should be equivalent to default settings for v3 Blosc codec
z2 = zarr_v2.array(a, chunks=z3.chunks, compressor=compressor)
v2_size = len(z2.store["0"])
print(f"v2 compressed chunk size: {v2_size}")
print(f"v3 compressed chunk size: {v3_size}")
# v2 compressed chunk size: 70113
# v3 compressed chunk size: 75136
The difference isn't huge in this case, but it can be much more noticeable in others. For example:
b = np.arange(1000000)
z3[:] = b
v3_size = len(z3.store_path.store._store_dict["c/0"])
z2[:] = b
v2_size = len(z2.store["0"])
print(f"v2 compressed chunk size: {v2_size}")
print(f"v3 compressed chunk size: {v3_size}")
# v2 compressed chunk size: 1383
# v3 compressed chunk size: 11348
Cause and Possible Solution
In numcodecs, the blosc compressor is able to improve compression ratios by inferring the item size through the input buffer's numpy array dtype. But in v3, the blosc codec is implemented as a BytesBytesCodec
and requires each chunk to be fed as bytes on encode (hence, BytesCodec()
is required in the list of codecs in my example) and thus numcodecs infers an item size of 1.
A simple fix for this is to make the following change in blosc.py
:
async def _encode_single(
self,
chunk_bytes: Buffer,
chunk_spec: ArraySpec,
) -> Buffer | None:
# Since blosc only support host memory, we convert the input and output of the encoding
# between numpy array and buffer
return await to_thread(
lambda chunk: chunk_spec.prototype.buffer.from_bytes(
self._blosc_codec.encode(chunk.as_numpy_array().view(chunk_spec.dtype))
),
chunk_bytes,
Thoughts
I am still just getting started with v3, but this has made me curious about one thing. Why is the blosc codec implemented as a BytesBytesCodec
rather than as an ArrayBytesCodec
, considering that it can accept (and is optimized for) numpy array input? Although the above solution does work, because I need to include the BytesCodec
first when specifying my codecs in v3, it essentially first encodes each chunk into bytes, then decodes it back into an array in its original dtype, making the bytes codec effectively a pointless noop in this case.
Metadata
Metadata
Assignees
Type
Projects
Status