Description
I am debating how to define the numpy S
and V
data types in Zarr.
The Numpy S*
is dtype is used for arrays of bytes where trailing null bytes are stripped when indexing:
>>> np.array([b'a\0\0'], dtype='S3').tobytes()
b'a\x00\x00'
>>> np.array([b'a\0\0'], dtype='S3')[0]
np.bytes_(b'a')
>>> np.array([b'a\0b'], dtype='S3')[0]
np.bytes_(b'a\x00b')
>>> np.array([b'a\0b'], dtype='S3').tobytes()
b'a\x00b'
Indexing S*
arrays returns a scalar with a variable length, depending on what's inside the array. I don't think these indexing quirks should be specified in the zarr spec.
Another question is whether S*
is a string, or bytes, data type. On one hand, Numpy suggests that you use string-based routines for manipulating these arrays, implying that they are a string-like dtype (which they could be, via Latin-1 encoding, which uses 1 byte per character):
>>> np.array([b'aaa\0'], dtype='S4') * 2
Traceback (most recent call last):
File "<python-input-44>", line 1, in <module>
np.array([b'aaa\0'], dtype='S4') * 2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
TypeError: The 'out' kwarg is necessary. Use numpy.strings.multiply without it.
>>> np.strings.multiply(np.array([b'aaa\0'], dtype='S4'), 1)
array([b'aaa'], dtype='|S3')
On the other hand, the S
data type uses np.bytes_
scalars, so I think it's safe to say that it's primarily labelled as bytes, even if some numpy routines may interpret those bytes as Latin-1 encoded strings.
The numpy V
data type is like S
, except trailing null bytes are not given special treatment when indexing:
>>> np.array([b'a\0\0'], dtype='V3').tobytes()
b'a\x00\x00'
>>> np.array([b'a\0\0'], dtype='V3')[0]
np.void(b'\x61\x00\x00')
From a Zarr POV, the differences between these two dtypes seem marginal. Both are fixed-length arrays of bytes. Although Numpy defines slightly different semantics for the two, I can't see a meaningful way to express that difference in a Zarr data type definition. I am open to feedback on this point -- maybe I have missed some key differences between numpy S*
and V*
.
But if I'm right, then the similarity between these dtypes argues for directing numpy S*
and V*
users to the same as-yet-unspecified fixed-length bytes Zarr data type. The r*
data type is not a great solution here because it has sub-byte resolution (edit: this is false, the raw bits dtype is restricted to multiples of 8), and is also poorly specified, as it does not have a fixed identifier.
If people broadly agree with this assessment, then I can simplify the PR over in zarr-python where I introduce zarr v3 data types for numpy S
and V
.