Skip to content

feature(data-types): add data-types from https://github.com/zarr-developers/zarr-python/pull/2874 #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Apr 15, 2025

This PR adds the definition of the 4 new data types from zarr-developers/zarr-python#2874. They include:

  • datetime64
  • fixed-length-ascii
  • fixed-length-bytes
  • fixed-length-ucs4

Closes #4

xref: zarr-developers/zarr-python#2874

…-bytes, and fixed-length-ucs4 data-types from zarr-python

## Permitted fill values

The value of the `fill_value` metadata key must be a signed 64-bit integer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to stick with the integer representation, we should clarify that this is a JSON representation of an integer between -2^63 and 2^63 - 1, inclusive. But as this data type is totally new for zarr v3 and we don't have to worry about backwards compatibility in the json format w.r.t. v2, we could consider a more date-like human-readable alternative to an int. I'd love to get feedback from heavy datetime users on what they would expect.

}
```

## Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If numpy compatibility is still the design goal for this dtype, I think it would make sense to say something about compatibility with the numpy datetime dtype, e.g. with a link to the numpy docs .

@@ -0,0 +1,33 @@
# Fixed-length ASCII data type

Defines a data type for fixed-length ASCII strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my initial name for this dtype (fixed-length-ascii) was misleading -- it models the numpy S dtype, which represents any fixed-length byte string, which includes non-ascii characters. So we should probably rename this to fixed-length-bytes or equivalent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the key distinction between this dtype (modelling numpy S*) and the fixed-length bytes dtype defined elsewhere in this PR (modelling numpy V*) is that scalars with this dtype are intended to be interpreted as strings, not opaque sequences of bytes. We should probably decide if we really want two dtypes for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy also assumes NUL-padding, i.e. that any NUL characters at the end can be ignored.


## Permitted fill values

The value of the `fill_value` metadata key must be a string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need more constraints here, such as as length constraint. for reference, in zarr v2 the fill value was a base64-encoded ascii string but it didn't constrain the length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the length constraint makes sense. But not critical for merging this, imo.


## Notes

Valid values for the `unit` configuration option include: `["Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as"]`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say something about the epoch (presumably unix epoch) and time standard (presumably UTC) it uses.

@d-v-b
Copy link
Contributor

d-v-b commented Apr 15, 2025

instead of two fixed-length string data types (ascii and ucs4), we could consider a single fixed-length string dtype that's parametrized by the string encoding.


## Notes

Valid values for the `unit` configuration option include: `["Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as"]`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the units listed after nanoseconds cannot represent dates beyond 1970 within 64-bits and therefore are presumably not very useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of this sentence is correct.

The second is not. There is a lot of use cases in the (paleo-)climate space for dates that go much further back in time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the range of 64-bit picoseconds is just a few months forward or backward from the epoch.

Presumably picoseconds would only be useful either with more than 64 bits or for a time duration rather than time point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it helpful if we define the goal of this datetime type as "give numpy datetime64 users a zarr data type they can use without thinking"? In this case, we optimize for compatibility and accept whatever decisions the numpy folks have made. We can always define a separate datetime data type that avoids perceived defects in the numpy datetime dtype, and we can point users who don't need maximal numpy compatibility to the better data type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment regarding datetime: numpy supports a special NaT (not a time) value. Is that intended to be supported here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm asking because, whether it's sensible or not, ps is a unit numpy datetime64 supports. Instead of litigating that decision, or any other numpy decision, we could optimize for compatibility here, which would simplify the task of this PR considerably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment regarding datetime: numpy supports a special NaT (not a time) value. Is that intended to be supported here?

yes i believe that should be supported, and so it should be defined in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those units don't hurt anything they just are very unlikely to be used intentionally. Numpy supports the same units for both datetime and timedelta --- are you intending to also add timedelta as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those units don't hurt anything they just are very unlikely to be used intentionally. Numpy supports the same units for both datetime and timedelta --- are you intending to also add timedelta as well?

yes, timedelta is planned.

"type": "object",
"properties": {
"unit": {
"type": "string"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be listed as enum rather than string

@jbms
Copy link
Contributor

jbms commented Apr 15, 2025

Regarding the datetime type generally, I think I'd be happier if that were handled by some sort of unit attribute that was independent of the data type.

For how this is handled by the udunits2 package, see: https://docs.unidata.ucar.edu/udunits/current/udunits2lib.html#Time

In udunits the unit would be specified as "seconds since 1970-01-01".

The complication with making it a data type is that you have to specify how it is handled by every single codec.

@jbms
Copy link
Contributor

jbms commented Apr 15, 2025

All of these need to specify which array -> bytes codecs support them (presumably just bytes), and how they are encoded.


## Permitted fill values

The value of the `fill_value` metadata key must be a string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the length specified in bits? What does it mean if you specify a length that is not a multiple of 8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think specifying the length in bits is a zarr-python implementation detail that we shouldn't have to deal with in the spec. i think a length in bytes is better here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that a "native" length key would be more intuitive for all of these dtypes. With "native", I mean 1-byte-lengths for ascii and bytes and 4-byte-lengths for utf-32.

@@ -0,0 +1,33 @@
# Fixed-length Unicode string data type

Defines a data type for fixed-length Unicode strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should also include a recommendation NOT to use this type except for compatibility with existing datasets, because UTF-32 encoding is basically never a good choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should include a note that it is intended to only contain valid unicode strings, and that implementations may return an error if given invalid data (e.g. utf-16 surrogate pairs, out-of-range characters) if that is what is intended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should include a note that NUL characters at the end are to be ignored, if that is what is intended.

"data_type": "fixed-length-ucs4",
"fill_value": "",
"configuration": {
"length_bits": 24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 24, which is not a multiple of 32, is confusing.

"configuration": {
"type": "object",
"properties": {
"length_bits": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the length in bits, and what happens if it is not a multiple of 32?

@@ -0,0 +1,33 @@
# Fixed-length Unicode string data type

Defines a data type for fixed-length Unicode strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think utf-32 is now more commonly used and may be a better choice than ucs-4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I mean the name "utf-32".

@jhamman
Copy link
Member Author

jhamman commented Apr 16, 2025

Thanks @d-v-b and @jbms for the initial reviews. I'll take a second pass tomorrow.

Comment on lines 15 to 19
"data_type": "datetime64",
"fill_value": -9223372036854775808,
"configuration": {
"unit": "s"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"data_type": "datetime64",
"fill_value": -9223372036854775808,
"configuration": {
"unit": "s"
},
"data_type": { "name": "datetime64", "configuration": {"unit":"s"} },
"fill_value": -9223372036854775808,

"required": ["name", "configuration"],
"additionalProperties": false
},
{ "const": "datetime64" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the configuration (with its unit field) is mandatory, the short-hand should not be permissible.

@normanrz
Copy link
Member

normanrz commented May 6, 2025

Since you mentioned that you are still working on this, I turned this PR into a draft PR, in accordance with #13. Feel free to switch to a regular PR as soon as you think it is ready.

@jhamman jhamman marked this pull request as ready for review May 29, 2025 16:19
Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a few open discussion items that I think would be nice to resolve.
However, the PR already meets the formal criteria for merging. Please let me know.

Comment on lines 3 to 25
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "fixed_length_utf32"
},
"configuration": {
"type": "object",
"properties": {
"length_bits": {
"type": "integer"
}
},
"required": ["length_bytes"],
"additionalProperties": false
}
},
"required": ["name", "configuration"],
"additionalProperties": false
},
{ "const": "fixed_length_utf32" }
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a configuration with length_bytes is required, the short form should not be permissible. The same applies to the other dtypes.

Suggested change
"oneOf": [
{
"type": "object",
"properties": {
"name": {
"const": "fixed_length_utf32"
},
"configuration": {
"type": "object",
"properties": {
"length_bits": {
"type": "integer"
}
},
"required": ["length_bytes"],
"additionalProperties": false
}
},
"required": ["name", "configuration"],
"additionalProperties": false
},
{ "const": "fixed_length_utf32" }
]
{
"type": "object",
"properties": {
"name": {
"const": "fixed_length_utf32"
},
"configuration": {
"type": "object",
"properties": {
"length_bytes": {
"type": "integer"
}
},
"required": ["length_bytes"],
"additionalProperties": false
}
},
"required": ["name", "configuration"],
"additionalProperties": false
}


## Permitted fill values

The value of the `fill_value` metadata key must be a string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the length constraint makes sense. But not critical for merging this, imo.


## Permitted fill values

The value of the `fill_value` metadata key must be a string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that a "native" length key would be more intuitive for all of these dtypes. With "native", I mean 1-byte-lengths for ascii and bytes and 4-byte-lengths for utf-32.

d-v-b and others added 2 commits June 2, 2025 19:36
@d-v-b
Copy link
Contributor

d-v-b commented Jun 3, 2025

A question about fill values: the v2 spec required that string fill values be encoded via bas64. Do we want to use the same encoding here?

@normanrz
Copy link
Member

normanrz commented Jun 3, 2025

A question about fill values: the v2 spec required that string fill values be encoded via bas64. Do we want to use the same encoding here?

I was wondering the same thing. For ascii it seems a bit pointless. For bytes and utf-32 it might be more intuitive/safer to encode, though.

@d-v-b
Copy link
Contributor

d-v-b commented Jun 3, 2025

I imagine base64 encoding the ascii was necessary for unprintable characters, and the same constraint might lead us to base64 encode unicode, too, but the json spec defines a scheme for escaping any unicode character, so maybe we should just use that, for ascii and unicode data types alike? This has the advantage of working automatically with JSON libraries:

json.loads(b'{"foo": "\u0000"}')
{'foo': '\x00'}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Naming the dtypes from zarr-python#2874
4 participants