Skip to content

refactor v3 data types #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 87 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 28, 2025

As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

In main,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 28, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

copying a comment @nenb made in this zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like: jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types like U* from parametrized instances of those types (like U3). Defining a class with the character code U means instances of the class can be initialized with a "length" parameter, and then we can make U2, U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class for U2, U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

name: ClassVar[str]
dtype_cls: ClassVar[type[TDType]] # this class will create a numpy dtype
kind: ClassVar[DataTypeFlavor]
default_value: TScalar
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child classes define a string name (which feeds into the zarr metadata), a dtype class dtype_cls (which gets assigned automatically from the generic type parameter) , a string kind (we use this for classifying scalars internally), and a default value (putting this here seems simpler than maintaining a function that matches dtype to default value, but we could potentially do that)

Comment on lines 268 to 283
class IntWrapperBase(DTypeWrapper[TDType, TScalar]):
kind = "numeric"

@classmethod
def from_dtype(cls, dtype: TDType) -> Self:
return cls()

def to_json_value(self, data: np.generic, zarr_format: ZarrFormat) -> int:
return int(data)

def from_json_value(
self, data: JSON, *, zarr_format: ZarrFormat, endianness: Endianness | None = None
) -> TScalar:
if check_json_int(data):
return self.to_dtype(endianness=endianness).type(data)
raise TypeError(f"Invalid type: {data}. Expected an integer.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use inheritance for dtypes like the integers, which really only differ in their concrete dtype + scalar types.

object.__setattr__(self, "_get_chunk_spec", lru_cache()(self._get_chunk_spec))
object.__setattr__(self, "_get_index_chunk_spec", lru_cache()(self._get_index_chunk_spec))
object.__setattr__(self, "_get_chunks_per_shard", lru_cache()(self._get_chunks_per_shard))
# TODO: fix these when we don't get hashability errors for certain numpy dtypes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to fix this. I think the LRU store cache was attempting to hash a non-hashable numpy dtype, and this caused very hard to debug errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to open an issue and link to it in this comment.

@nenb
Copy link

nenb commented Mar 3, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 4, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

@d-v-b d-v-b mentioned this pull request Mar 5, 2025
6 tasks
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 16, 2025
@christine-e-smit
Copy link
Contributor

@d-v-b - I have a bunch of zarr stores in our production system with datetime variables, so we are currently blocked from upgrading to v3. @jhamman indicated in #2616 that this pull request will add support for datetime variables. It looks like there's a lot going on here, but do you have a sense for when this will be merged and there will be a release we can use? Thank you!

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 28, 2025
@d-v-b d-v-b closed this Apr 29, 2025
@d-v-b d-v-b deleted the feat/fixed-length-strings branch April 29, 2025 13:32
@github-project-automation github-project-automation bot moved this from In progress to Done in zarr-python release 3.1.0 Apr 29, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 29, 2025

ok, so today I learned that renaming a branch automatically closes the associated PR. good to know!

@d-v-b d-v-b restored the feat/fixed-length-strings branch April 29, 2025 13:40
@d-v-b d-v-b reopened this Apr 29, 2025
@TomNicholas
Copy link
Member

I think it's that if branch A was split off from branch B, and you delete branch B, any PRs to merge A into B/main will be automatically closed.

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 29, 2025

Thoughts on this strategy?

I'm a fan! Are you thinking of keeping this PR as the "new dtype abstraction" PR. Then new PRs for the dtypes?

I realized that the strategy of splitting this PR is not very practical. The main reason is that on the v2 codepath, we already support (and test) structured dtypes and datetimes. So if the new dtype abstraction is sufficient to pass the tests we already have, then it has to support those dtypes. I could do timedelta64 in a separate PR, but I"m not sure it's worth the work (I suspect that dtype wrapper will be very similar to datetime64). So I'm going to continue with this PR, unless anyone stridently objects.

In the interest of making things easier to digest, I created a dtype.npy module with all the numpy-specific dtypes. Each data type family (ints, floats, complex, times, etc) is defined in its own submodule. Previously all the dtypes were defined in a single very long file. I think this is more readable, but I'm open to feedback.

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 30, 2025

adding timedelta64 revealed some issues with array indexing, which are fixed in #3027

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.