-
-
Notifications
You must be signed in to change notification settings - Fork 327
refactor v3 data types #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor v3 data types #2874
Conversation
…into feat/fixed-length-strings
copying a comment @nenb made in this zulip discussion:
A feature of the character code is that it provides a way to distinguish parametric types like |
src/zarr/core/metadata/dtype.py
Outdated
name: ClassVar[str] | ||
dtype_cls: ClassVar[type[TDType]] # this class will create a numpy dtype | ||
kind: ClassVar[DataTypeFlavor] | ||
default_value: TScalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child classes define a string name
(which feeds into the zarr metadata), a dtype class dtype_cls
(which gets assigned automatically from the generic type parameter) , a string kind
(we use this for classifying scalars internally), and a default value (putting this here seems simpler than maintaining a function that matches dtype to default value, but we could potentially do that)
src/zarr/core/metadata/dtype.py
Outdated
class IntWrapperBase(DTypeWrapper[TDType, TScalar]): | ||
kind = "numeric" | ||
|
||
@classmethod | ||
def from_dtype(cls, dtype: TDType) -> Self: | ||
return cls() | ||
|
||
def to_json_value(self, data: np.generic, zarr_format: ZarrFormat) -> int: | ||
return int(data) | ||
|
||
def from_json_value( | ||
self, data: JSON, *, zarr_format: ZarrFormat, endianness: Endianness | None = None | ||
) -> TScalar: | ||
if check_json_int(data): | ||
return self.to_dtype(endianness=endianness).type(data) | ||
raise TypeError(f"Invalid type: {data}. Expected an integer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use inheritance for dtypes like the integers, which really only differ in their concrete dtype + scalar types.
object.__setattr__(self, "_get_chunk_spec", lru_cache()(self._get_chunk_spec)) | ||
object.__setattr__(self, "_get_index_chunk_spec", lru_cache()(self._get_index_chunk_spec)) | ||
object.__setattr__(self, "_get_chunks_per_shard", lru_cache()(self._get_chunks_per_shard)) | ||
# TODO: fix these when we don't get hashability errors for certain numpy dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to fix this. I think the LRU store cache was attempting to hash a non-hashable numpy dtype, and this caused very hard to debug errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to open an issue and link to it in this comment.
Summarising from a zulip discussion: @nenb: How is the endianness of a dtype handled? @d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype. Proposed solution: Make |
Thanks for the summary! I have implemented the proposed solution. |
@d-v-b - I have a bunch of zarr stores in our production system with datetime variables, so we are currently blocked from upgrading to v3. @jhamman indicated in #2616 that this pull request will add support for datetime variables. It looks like there's a lot going on here, but do you have a sense for when this will be merged and there will be a release we can use? Thank you! |
…at/fixed-length-strings
ok, so today I learned that renaming a branch automatically closes the associated PR. good to know! |
I think it's that if branch A was split off from branch B, and you delete branch B, any PRs to merge A into B/main will be automatically closed. |
I realized that the strategy of splitting this PR is not very practical. The main reason is that on the v2 codepath, we already support (and test) structured dtypes and datetimes. So if the new dtype abstraction is sufficient to pass the tests we already have, then it has to support those dtypes. I could do In the interest of making things easier to digest, I created a |
adding timedelta64 revealed some issues with array indexing, which are fixed in #3027 |
…into feat/fixed-length-strings
As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.
In
main
,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.