Skip to content

[v3] Fixed-width unicode string support in zarr v3 #2347

Open
@TomAugspurger

Description

@TomAugspurger

Zarr version

v3

Numcodecs version

na

Python Version

na

Operating System

na

Installation

na

Description

Mentioned in #2323 (comment), right now we can't create a fixed-width string dtype in zarr v3.

In [1]: import zarr

In [2]: arr = zarr.create(shape=(3,), dtype="U3")

In [3]: arr[:] = ['a', 'bb', 'ccc']

In [4]: arr[:]
Out[4]: array(['a', 'bb', 'ccc'], dtype=StringDType())

We would want the NumPy dtype of that array to be U3, a fixed-width unicode string dtype. We'd want to support this in addition to the variable width strings being used currently. Some initial questions I don't know the answer to:

  1. What data_type shows up in the metadata?
  2. What codecs are needed?
  3. How are the actual bytes stored? In parquet, fixed_len_byte_array is one of the primitive types.

Steps to reproduce

.

Additional output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugPotential issues with the zarr-python library

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions