numpy `S*` and `V*` dtypes

I am debating how to define the numpy `S` and `V` data types in Zarr. 

The Numpy `S*` is dtype is used for arrays of bytes where trailing null bytes are stripped when indexing:

```python
>>> np.array([b'a\0\0'], dtype='S3').tobytes()
b'a\x00\x00'
>>> np.array([b'a\0\0'], dtype='S3')[0]
np.bytes_(b'a')
>>> np.array([b'a\0b'], dtype='S3')[0]
np.bytes_(b'a\x00b')
>>> np.array([b'a\0b'], dtype='S3').tobytes()
b'a\x00b'
```

Indexing `S*` arrays returns a scalar with a variable length, depending on what's inside the array. I _don't_ think these indexing quirks should be specified in the zarr spec.

Another question is whether `S*` is a string, or bytes, data type. On one hand, Numpy suggests that you use string-based routines for manipulating these arrays, implying that they are a string-like dtype (which they could be, via Latin-1 encoding, which uses 1 byte per character):

```python
>>> np.array([b'aaa\0'], dtype='S4') * 2
Traceback (most recent call last):
  File "<python-input-44>", line 1, in <module>
    np.array([b'aaa\0'], dtype='S4') * 2
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
TypeError: The 'out' kwarg is necessary. Use numpy.strings.multiply without it.
>>> np.strings.multiply(np.array([b'aaa\0'], dtype='S4'), 1)
array([b'aaa'], dtype='|S3')
```

On the other hand, the `S` data type uses `np.bytes_` scalars, so I think it's safe to say that it's primarily labelled as bytes, even if some numpy routines may interpret those bytes as Latin-1 encoded strings.

The numpy `V` data type is like `S`, except trailing null bytes are not given special treatment when indexing:

```python
>>> np.array([b'a\0\0'], dtype='V3').tobytes()
b'a\x00\x00'
>>> np.array([b'a\0\0'], dtype='V3')[0]
np.void(b'\x61\x00\x00')
```

From a Zarr POV, the differences between these two dtypes seem marginal. Both are fixed-length arrays of bytes. Although Numpy defines slightly different semantics for the two, I can't see a meaningful way to express that difference in a Zarr data type definition. I am open to feedback on this point -- maybe I have missed some key differences between numpy `S*` and `V*`. 

But if I'm right, then the similarity between these dtypes argues for directing numpy `S*` and `V*` users to the same as-yet-unspecified fixed-length bytes Zarr data type. The `r*` data type is not a great solution here because ~it has sub-byte resolution~ (edit: this is false, the raw bits dtype is restricted to multiples of 8), and is also poorly specified, as it does not have a fixed identifier.

If people broadly agree with this assessment, then I can simplify the PR over in zarr-python where I introduce zarr v3 data types for numpy `S` and `V`. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

numpy `S` and `V` dtypes #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

numpy S* and V* dtypes #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

numpy `S` and `V` dtypes #18