Skip to content

numpy S* and V* dtypes #18

Open
Open
@d-v-b

Description

@d-v-b

I am debating how to define the numpy S and V data types in Zarr.

The Numpy S* is dtype is used for arrays of bytes where trailing null bytes are stripped when indexing:

>>> np.array([b'a\0\0'], dtype='S3').tobytes()
b'a\x00\x00'
>>> np.array([b'a\0\0'], dtype='S3')[0]
np.bytes_(b'a')
>>> np.array([b'a\0b'], dtype='S3')[0]
np.bytes_(b'a\x00b')
>>> np.array([b'a\0b'], dtype='S3').tobytes()
b'a\x00b'

Indexing S* arrays returns a scalar with a variable length, depending on what's inside the array. I don't think these indexing quirks should be specified in the zarr spec.

Another question is whether S* is a string, or bytes, data type. On one hand, Numpy suggests that you use string-based routines for manipulating these arrays, implying that they are a string-like dtype (which they could be, via Latin-1 encoding, which uses 1 byte per character):

>>> np.array([b'aaa\0'], dtype='S4') * 2
Traceback (most recent call last):
  File "<python-input-44>", line 1, in <module>
    np.array([b'aaa\0'], dtype='S4') * 2
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
TypeError: The 'out' kwarg is necessary. Use numpy.strings.multiply without it.
>>> np.strings.multiply(np.array([b'aaa\0'], dtype='S4'), 1)
array([b'aaa'], dtype='|S3')

On the other hand, the S data type uses np.bytes_ scalars, so I think it's safe to say that it's primarily labelled as bytes, even if some numpy routines may interpret those bytes as Latin-1 encoded strings.

The numpy V data type is like S, except trailing null bytes are not given special treatment when indexing:

>>> np.array([b'a\0\0'], dtype='V3').tobytes()
b'a\x00\x00'
>>> np.array([b'a\0\0'], dtype='V3')[0]
np.void(b'\x61\x00\x00')

From a Zarr POV, the differences between these two dtypes seem marginal. Both are fixed-length arrays of bytes. Although Numpy defines slightly different semantics for the two, I can't see a meaningful way to express that difference in a Zarr data type definition. I am open to feedback on this point -- maybe I have missed some key differences between numpy S* and V*.

But if I'm right, then the similarity between these dtypes argues for directing numpy S* and V* users to the same as-yet-unspecified fixed-length bytes Zarr data type. The r* data type is not a great solution here because it has sub-byte resolution (edit: this is false, the raw bits dtype is restricted to multiples of 8), and is also poorly specified, as it does not have a fixed identifier.

If people broadly agree with this assessment, then I can simplify the PR over in zarr-python where I introduce zarr v3 data types for numpy S and V.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions