Poor blosc compression ratios compared to v2

### Zarr version

v3

### Numcodecs version

n/a

### Python Version

n/a

### Operating System

n/a

### Installation

n/a

### Description

While playing around with v3 a bit, I noticed that the blosc codec wasn't compressing my sample data as well as I expected to, and I have confirmed after multiple comparisons that I am getting different results between v3 and v2, in some cases v3 blosc compressed chunks end up being 10-20x larger than in v2 (see below).

### Steps to reproduce

```python
import numpy as np
import zarr
import zarr.v2 as zarr_v2
import numpy as np
from numcodecs import Blosc

a = np.random.rand(1000000)

# v3
codecs = [zarr.codecs.BytesCodec(), zarr.codecs.BloscCodec()]
z3 = zarr.array(a, chunks=(10000), codecs=codecs)
v3_size = len(z3.store_path.store._store_dict["c/0"]) # 75136

# v2
compressor = Blosc("zstd") # Should be equivalent to default settings for v3 Blosc codec
z2 = zarr_v2.array(a, chunks=z3.chunks, compressor=compressor)
v2_size = len(z2.store["0"]) 

print(f"v2 compressed chunk size: {v2_size}") 
print(f"v3 compressed chunk size: {v3_size}") 
# v2 compressed chunk size: 70113
# v3 compressed chunk size: 75136
```

The difference isn't huge in this case, but it can be much more noticeable in others. For example:
```python
b = np.arange(1000000)
z3[:] = b
v3_size = len(z3.store_path.store._store_dict["c/0"]) 

z2[:] = b
v2_size = len(z2.store["0"]) 

print(f"v2 compressed chunk size: {v2_size}") 
print(f"v3 compressed chunk size: {v3_size}") 

# v2 compressed chunk size: 1383
# v3 compressed chunk size: 11348
```

### Cause and Possible Solution
In numcodecs, the blosc compressor is able to improve compression ratios by inferring the item size through the input buffer's numpy array dtype. But in v3, the blosc codec is implemented as a `BytesBytesCodec` and requires each chunk to be fed as bytes on encode (hence, `BytesCodec()` is required in the list of codecs in my example) and thus numcodecs infers an item size of 1.

A simple fix for this is to make the following change in `blosc.py`:
```python
    async def _encode_single(
        self,
        chunk_bytes: Buffer,
        chunk_spec: ArraySpec,
    ) -> Buffer | None:
        # Since blosc only support host memory, we convert the input and output of the encoding
        # between numpy array and buffer
        return await to_thread(
            lambda chunk: chunk_spec.prototype.buffer.from_bytes(
                self._blosc_codec.encode(chunk.as_numpy_array().view(chunk_spec.dtype))
            ),
            chunk_bytes,
```
### Thoughts
I am still just getting started with v3, but this has made me curious about one thing. Why is the blosc codec implemented as a `BytesBytesCodec` rather than as an `ArrayBytesCodec`, considering that it can accept (and is optimized for) numpy array input? Although the above solution does work, because I need to include the `BytesCodec` first when specifying my codecs in v3, it essentially first encodes each chunk into bytes, then decodes it back into an array in its original dtype, making the bytes codec effectively a pointless noop in this case.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Poor blosc compression ratios compared to v2 #2171

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Cause and Possible Solution

Thoughts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Poor blosc compression ratios compared to v2 #2171

Description

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Cause and Possible Solution

Thoughts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions