Skip to content

Poor blosc compression ratios compared to v2 #2171

Closed
@agoodm

Description

@agoodm

Zarr version

v3

Numcodecs version

n/a

Python Version

n/a

Operating System

n/a

Installation

n/a

Description

While playing around with v3 a bit, I noticed that the blosc codec wasn't compressing my sample data as well as I expected to, and I have confirmed after multiple comparisons that I am getting different results between v3 and v2, in some cases v3 blosc compressed chunks end up being 10-20x larger than in v2 (see below).

Steps to reproduce

import numpy as np
import zarr
import zarr.v2 as zarr_v2
import numpy as np
from numcodecs import Blosc

a = np.random.rand(1000000)

# v3
codecs = [zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()]
z3 = zarr.array(a, chunks=(10000), codecs=codecs)
v3_size = len(z3.store_path.store._store_dict["c/0"]) # 75136

# v2
compressor = Blosc("zstd") # Should be equivalent to default settings for v3 Blosc codec
z2 = zarr_v2.array(a, chunks=z3.chunks, compressor=compressor)
v2_size = len(z2.store["0"]) 

print(f"v2 compressed chunk size: {v2_size}") 
print(f"v3 compressed chunk size: {v3_size}") 
# v2 compressed chunk size: 70113
# v3 compressed chunk size: 75136

The difference isn't huge in this case, but it can be much more noticeable in others. For example:

b = np.arange(1000000)
z3[:] = b
v3_size = len(z3.store_path.store._store_dict["c/0"]) 

z2[:] = b
v2_size = len(z2.store["0"]) 

print(f"v2 compressed chunk size: {v2_size}") 
print(f"v3 compressed chunk size: {v3_size}") 

# v2 compressed chunk size: 1383
# v3 compressed chunk size: 11348

Cause and Possible Solution

In numcodecs, the blosc compressor is able to improve compression ratios by inferring the item size through the input buffer's numpy array dtype. But in v3, the blosc codec is implemented as a BytesBytesCodec and requires each chunk to be fed as bytes on encode (hence, BytesCodec() is required in the list of codecs in my example) and thus numcodecs infers an item size of 1.

A simple fix for this is to make the following change in blosc.py:

    async def _encode_single(
        self,
        chunk_bytes: Buffer,
        chunk_spec: ArraySpec,
    ) -> Buffer | None:
        # Since blosc only support host memory, we convert the input and output of the encoding
        # between numpy array and buffer
        return await to_thread(
            lambda chunk: chunk_spec.prototype.buffer.from_bytes(
                self._blosc_codec.encode(chunk.as_numpy_array().view(chunk_spec.dtype))
            ),
            chunk_bytes,

Thoughts

I am still just getting started with v3, but this has made me curious about one thing. Why is the blosc codec implemented as a BytesBytesCodec rather than as an ArrayBytesCodec, considering that it can accept (and is optimized for) numpy array input? Although the above solution does work, because I need to include the BytesCodec first when specifying my codecs in v3, it essentially first encodes each chunk into bytes, then decodes it back into an array in its original dtype, making the bytes codec effectively a pointless noop in this case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugPotential issues with the zarr-python library

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions