-
Notifications
You must be signed in to change notification settings - Fork 7
feature(data-types): add data-types from https://github.com/zarr-developers/zarr-python/pull/2874 #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feature(data-types): add data-types from https://github.com/zarr-developers/zarr-python/pull/2874 #5
Changes from all commits
89ea29d
ba09d7e
1be0c9e
aa21bfa
46ac0b8
13f8a4a
bbc69d1
7894f9b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Fixed-length ASCII data type | ||
|
||
Defines a data type for fixed-length ASCII strings. | ||
|
||
## Permitted fill values | ||
|
||
The value of the `fill_value` metadata key must be a string. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we will need more constraints here, such as as length constraint. for reference, in zarr v2 the fill value was a base64-encoded ascii string but it didn't constrain the length. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that the length constraint makes sense. But not critical for merging this, imo. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is the length specified in bits? What does it mean if you specify a length that is not a multiple of 8? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think specifying the length in bits is a zarr-python implementation detail that we shouldn't have to deal with in the spec. i think a length in bytes is better here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also think that a "native" |
||
|
||
## Example | ||
|
||
For example, the array metadata below specifies that the array contains fixed-length ASCII strings: | ||
|
||
```json | ||
{ | ||
"data_type": "fixed-length-ascii", | ||
"fill_value": "", | ||
"configuration": { | ||
"length_bits": 24 | ||
}, | ||
} | ||
``` | ||
|
||
## Notes | ||
|
||
TBD | ||
|
||
## Change log | ||
|
||
No changes yet. | ||
|
||
## Current maintainers | ||
|
||
* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "fixed-length-ascii" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"properties": { | ||
"length_bits": { | ||
"type": "integer" | ||
} | ||
}, | ||
"required": ["length_bits"], | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name", "configuration"], | ||
"additionalProperties": false | ||
}, | ||
{ "const": "fixed-length-ascii" } | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Fixed-length bytes data type | ||
|
||
Defines a data type for fixed-length byte strings. | ||
|
||
## Permitted fill values | ||
|
||
The value of the `fill_value` metadata key must be a string. | ||
|
||
## Example | ||
|
||
For example, the array metadata below specifies that the array contains fixed-length byte strings: | ||
|
||
```json | ||
{ | ||
"data_type": "fixed-length-bytes", | ||
"fill_value": "", | ||
"configuration": { | ||
"length_bits": 24 | ||
}, | ||
} | ||
``` | ||
|
||
## Notes | ||
|
||
TBD | ||
|
||
## Change log | ||
|
||
No changes yet. | ||
|
||
## Current maintainers | ||
|
||
* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "fixed-length-bytes" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"properties": { | ||
"length_bits": { | ||
"type": "integer" | ||
} | ||
}, | ||
"required": ["length_bits"], | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name", "configuration"], | ||
"additionalProperties": false | ||
}, | ||
{ "const": "fixed-length-bytes" } | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# `fixed_length_utf32` data type | ||
|
||
This document defines a data type for fixed-length, null-terminated Unicode strings encoded using [UTF-32](https://www.unicode.org/versions/Unicode5.0.0/appC.pdf#M9.19040.HeadingAppendix.C2.Encoding.Forms.in.ISOIEC.10646). UTF-32, also known as UCS4, is an encoding of Unicode strings that allocates 4 bytes to each Unicode code point. | ||
|
||
"Fixed length" as used here means that the `fixed_length_utf32` data type is parametrized by a integral length, which sets a fixed length for every scalar belonging to that data type. | ||
|
||
"Null-terminated" as used here means that, for an integral length `L`, a `fixed_length_utf32` data type parameterized with `L` can represent a string shorter than `L` by adding null bytes to the end of that string until it has length `L`. | ||
|
||
### Name | ||
|
||
The name of this data type is the string `"fixed_length_utf32"` | ||
|
||
### Configuration | ||
|
||
This data type requires a configuration. The configuration for this data type is a JSON object with the following fields: | ||
|
||
| field name | type | required | notes | | ||
|------------|----------|---|---| | ||
| `"length_bytes"` | integer | yes | The number MUST represent an integer divisible by 4 in the inclusive range `[0, 2147483644]` | | ||
|
||
> Note: the maximum length of 2147483644 was chosen to match the semantics of the [NumPy `"U"` data type](https://numpy.org/devdocs/reference/arrays.scalars.html#numpy.str_), which as of this writing has a maximum length in code points of 536870911, i.e. 2147483644 / 4. | ||
|
||
> Note: given a particular `fixed_length_utf32` data type, the length of an array element in Unicode code points is the value of the `length_bytes` field divided by 4. | ||
|
||
### Examples | ||
|
||
```json | ||
{ | ||
"name": "fixed_length_utf32", | ||
"configuration" : { | ||
"length_bytes": 4 | ||
} | ||
} | ||
``` | ||
|
||
## Fill value representation | ||
|
||
The value of the `fill_value` metadata key must be a string. When encoded in UTF-32, the fill value MUST have a length in bytes less than or equal to the value of the `length_bytes` specified in the `configuration` of this data type. | ||
|
||
## Codec compatibility | ||
|
||
This data type is compatible with any codec that supports arrays with fixed-sized data types. | ||
|
||
## Notes | ||
|
||
This data type is designed for NumPy compatibility. UTF-32 is not a good fit for many applications that need to model arrays of strings, as real string datasets are often composed of variable-length strings. A variable-length string data type should be preferred in these cases. | ||
|
||
## Change log | ||
|
||
No changes yet. | ||
|
||
## Current maintainers | ||
|
||
* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
{ | ||
"$schema": "https://json-schema.org/draft/2020-12/schema", | ||
"oneOf": [ | ||
{ | ||
"type": "object", | ||
"properties": { | ||
"name": { | ||
"const": "fixed_length_utf32" | ||
}, | ||
"configuration": { | ||
"type": "object", | ||
"properties": { | ||
"length_bytes": { | ||
"type": "integer" | ||
} | ||
}, | ||
"required": ["length_bytes"], | ||
"additionalProperties": false | ||
} | ||
}, | ||
"required": ["name", "configuration"], | ||
"additionalProperties": false | ||
}, | ||
{ "const": "fixed_length_utf32" } | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my initial name for this dtype (fixed-length-ascii) was misleading -- it models the numpy
S
dtype, which represents any fixed-length byte string, which includes non-ascii characters. So we should probably rename this tofixed-length-bytes
or equivalent.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the key distinction between this dtype (modelling numpy
S*
) and the fixed-length bytes dtype defined elsewhere in this PR (modelling numpyV*
) is that scalars with this dtype are intended to be interpreted as strings, not opaque sequences of bytes. We should probably decide if we really want two dtypes for this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpy also assumes NUL-padding, i.e. that any NUL characters at the end can be ignored.