Skip to content

refactor v3 data types #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 154 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 28, 2025

As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

In main,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 28, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

copying a comment @nenb made in this zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like: jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types like U* from parametrized instances of those types (like U3). Defining a class with the character code U means instances of the class can be initialized with a "length" parameter, and then we can make U2, U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class for U2, U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

@nenb
Copy link

nenb commented Mar 3, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 4, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

@d-v-b d-v-b mentioned this pull request Mar 5, 2025
6 tasks
@nenb
Copy link

nenb commented May 29, 2025

Expressing my personal opinion here: I would be very much in favour of getting this into a pre-release so that I can test it with a variety of new data types and see what happens. Like you have pointed out, testing in the pre-release will likely be the best way to motivate what parts of the documentation and the (internal) API need to be improved before an actual release.

Great work!

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 3, 2025

@ianhi @dstansby if you have the time could you check and see if the changes you requested have all been addressed?

@dstansby
Copy link
Contributor

dstansby commented Jun 3, 2025

Sorry, I'm not going to have time to re-review this in the near future. If you've looked at all my previous comments feel free to dimsiss my review.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 3, 2025

Sorry, I'm not going to have time to re-review this in the near future. If you've looked at all my previous comments feel free to dimsiss my review.

no worries, and thanks for your patience with this big PR. I'll review your feedback; some things might get spun out into issues (i'm thinking about changes in the config)

Copy link
Contributor

@ianhi ianhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @nenb on getting this in and into a pre-release

docs
I'm 👍 on the understanding that a minimal implementation of #3090 is necessary for a non-pre-release

I think this combines with my comments about doc strings for the dtype registry, but that is also impacted by how those are eventually dealt with.

dtype resolution

But this is complicated by the numpy void dtype, which supports two separate zarr data types: fixed-length raw bytes and structured dtypes. One solution would be to add numpy void to the list of data types we don't dynamically resolve, and then I think the static {numpy dtype: zarr dtype} mapping can work. But I consider this a refinement of the content in this PR, and suitable for a separate effort.

I think this in an important refinement as it will make making a new dtype much simpler. But agree that if we try to add that here this will never finish.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 6, 2025

@zarr-developers/core-devs sorry for the ping, but this PR is a) basically ready to go and b) necessary to solve a basic problem this library. It would be great to get your eyes on this, because we have zarr users who need these features, and that will lead us to merge it soon.

In particular, I am curious to hear any objections or concerns regarding the core strategy here, anything that would suggest that this PR should put on hold, or even scrapped. Now is a great time for you to voice those concerns.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 6, 2025

oops, I pinged a group with no members. trying again with @zarr-developers/python-core-devs

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 6, 2025

a quick last-minute fix: I was modelling the numpy S* dtypes as ascii-encoded strings, but it's actually null-terminated byte strings, and the only thing ascii-like about it is the use of ascii decoding for the repr of those bytes. So I have refactored the fixed-length bytes types into RawBytes (numpy V*, with no fields attribute), NullTerminatedBytes (numpy S*), and FixedLengthBytes VariableLengthBytes (numpy |O with the object_codec set to vlen-bytes). I also added proper tests for this latter dtype.

@d-v-b d-v-b dismissed dstansby’s stale review June 6, 2025 16:36

he gave me permission to dismiss: #2874 (comment)

Comment on lines -81 to -83
"numeric": {"id": "zstd", "level": 0, "checksum": False},
"string": {"id": "zstd", "level": 0, "checksum": False},
"bytes": {"id": "zstd", "level": 0, "checksum": False},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these and the other removals below get added to a deprecations config -https://donfig.readthedocs.io/en/latest/configuration.html#deprecations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm wondering if that should happen in this PR or in a follow-up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO a follow-up PR is best since the most helpful/important next step is enabling community testing via a pre-release. So maybe just a todo or issue to resolve this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an issue tracking config-related changes in 3.1, I will paraphrase and link to your suggestion there

@LDeakin
Copy link
Member

LDeakin commented Jun 7, 2025

FixedLengthBytes (numpy |O with the object_codec set to vlen-bytes)

Do you mean VariableLengthBytes here? Also FixedLengthBytesJSONV3 appears to be unused.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 7, 2025

FixedLengthBytes (numpy |O with the object_codec set to vlen-bytes)

Do you mean VariableLengthBytes here? Also FixedLengthBytesJSONV3 appears to be unused.

Yes, my mistake. Indeed that should be VariableLengthBytes. And I will explore why that typeddict is not being used!

edit: it was unused because it had been replaced by a newer typeddict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.