Skip to content

fill out numpy U dtype #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 0 additions & 33 deletions data-types/datetime64/README.md

This file was deleted.

26 changes: 0 additions & 26 deletions data-types/datetime64/schema.json

This file was deleted.

33 changes: 0 additions & 33 deletions data-types/fixed-length-ucs4/README.md

This file was deleted.

52 changes: 52 additions & 0 deletions data-types/fixed-length-utf32/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# `fixed_length_utf32` data type

This document defines a data type for fixed-length Unicode strings encoded using [UTF-32](https://www.unicode.org/versions/Unicode5.0.0/appC.pdf#M9.19040.HeadingAppendix.C2.Encoding.Forms.in.ISOIEC.10646). UTF-32, also known as UCS4, is an encoding of Unicode strings that allocates 4 bytes to each Unicode code point.

"Fixed length" as used here means that the `fixed_length_utf32` data type is parametrized by a integral length, which sets a fixed length for every scalar belonging to that data type.

### Name

The name of this data type is the string `"fixed_length_utf32"`

### Configuration

This data type requires a configuration. The configuration for this data type is a JSON object with the following fields:

| field name | type | required | notes |
|------------|----------|---|---|
| `"length_bytes"` | integer | yes | The number MUST represent an integer divisible by 4 in the inclusive range `[0, 2147483644]` |

> Note: the maximum length of 2147483644 was chosen to match the semantics of the [NumPy `"U"` data type](https://numpy.org/devdocs/reference/arrays.scalars.html#numpy.str_), which as of this writing has a maximum length in code points of 536870911, i.e. 2147483644 / 4.

> Note: given a particular `fixed_length_utf32` data type, the length of an array element in Unicode code points is the value of the `length_bytes` field divided by 4.

### Examples

```json
{
"name": "fixed_length_utf32",
"configuration" : {
"length_bytes": 4
}
}
```

## Fill value representation

The value of the `fill_value` metadata key must be a string. When encoded in UTF-32, the fill value MUST have a length in bytes equal to the value of the `length_bytes` specified in the `configuration` of this data type.

## Codec compatibility

This data type is compatible with any codec that supports arrays with fixed-sized data types.

## Notes

This data type is designed for NumPy compatibility. UTF-32 is not a good fit for many applications that need to model arrays of strings, as real string datasets are often composed of variable-length strings. A variable-length string data type should be preferred in these cases.

## Change log

No changes yet.

## Current maintainers

* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"type": "object",
"properties": {
"name": {
"const": "fixed-length-ucs4"
"const": "fixed_length_utf32"
},
"configuration": {
"type": "object",
Expand All @@ -14,13 +14,13 @@
"type": "integer"
}
},
"required": ["length_bits"],
"required": ["length_bytes"],
"additionalProperties": false
}
},
"required": ["name", "configuration"],
"additionalProperties": false
},
{ "const": "fixed-length-ucs4" }
{ "const": "fixed_length_utf32" }
]
}