-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise how the domain of an array is specified #144
Conversation
This adds support for dimension names (zarr-developers#73) and non-zero origins (zarr-developers#122).
I think the main objection was from zarr.jl, in particular I think there is some disagreement as to how interoperability between languages should work as far as index vectors. The zarr.jl model is that (i, j, k) in zarr-python should correspond to (k+1, j+1, I+1) in zarr.jl., while I would prefer that zarr.jl users just define an array with a Fortran order layout and inclusive_min of (1, 1, 1) if that is what is desired, so that there is no confusion or inconsistency with indexing. Still I think there are some options for how this feature could be supported in zarr.jl while retaining the option to do that coordinate transform. |
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Perhaps since no objections have been raised in the last few days it makes sense to merge this and address any objections as part of the general ZEP 1 discussion? I have some additional changes I'd like to propose that build on this. |
I have just read the proposed changes and as you already mentioned I would be quite unhappy when we force offset indexing into the core data model. I tried to read #122 carefully and still do not understand why these types of transformations have to be part of the core data model and can not be defined in an extension. If I implement the deafults you suggest in Julia, imagine I save and publish a dataset with Fortran order and define a domain using 1-based indexing. Now everyone loading my dataset from python/Javascript/C++ etc will have to access my data using 1-based indexing, just because this is what I used in my Julia code? And in addition they have to remember to revert their loop-dimensions for cache-friendly iteration through the dataset. And the same applies for the other way: All Julia users will have to deal with 0-based OffsetArrays and PermutedDimsArrays just to make sure to have a 1-1 mapping between array domains between different programming languages. I am in full support of the idea to allow negative chunk indices and and for allowing custom offsets, I even use an |
I am also opposed to adding offset indexing into the core Zarr data model, for the reasons I outlined in #122 (comment). I recognize that this feature is very important to some user communities. We should definitely find a way to to address it. But I don't think core specification is the right place. The core specification describes the minimum feature set that all implementations must support. Under the core specification, I believe that each implementation should expose indexing on Zarr arrays using its own natural language-specific indexing convention, i.e. 0-based C-order for python / C, 1-based F-order for Julia / Fortran, etc. Instead, an extension could be defined which defines a transformation between the raw index space and the offset index space. Implementations could then choose to expose the offset indexes or the raw indexes. @jbms - can you explain why you think this feature needs to be in the core spec as opposed to an extension? |
Thanks for your feedback @rabernat and @meggart. To be clear, this PR only adds support for non-zero origins, but does not add support for arbitrary dimension orders. I was planning to create a PR for arbitrary dimension order support, but I think it would be simpler if for now we just discuss non-zero origin support. As I see it, being able to consistently and unambiguously refer to positions or regions of an array is a core component of an array storage standard. While I can understand that this is not important for some applications, I think that this tends to be more important as arrays become too large to fit in memory, which is precisely the class of applications for which a chunked array format like zarr is most useful. @rabernat asks why non-zero origins support isn't better supported as an extension. I see several possibilities:
While this feature certainly has implications on the zarr data model, I don't think it imposes a significant implementation burden --- it is just two additional optional parameters that must be managed, and taken into account in indexing operations. @meggart raises the concern that this feature might require support for in-memory arrays with non-zero origins in every zarr implementation, e.g. |
I'm curious to hear the Unidata perspective. @WardF & @DennisHeimbigner: would would be the implication of putting this feature in the core specification for NetCDF-C? How would you deal with the fact that Zarr arrays could potentially have non-zero origin as in this proposal? |
We might be able to at least pass info to the user by exposing an attribute that |
I suppose that we can handle this similar to the existing fortran solution. |
That is correct, but note that the relevant offset |
Summarizing some of the discussion that occurred about this on yesterday's call: I argued that the Zarr specification essentially remain agnostic about indexing. Different languages have different index idioms for addressing the elements in ND-arrays. The Zarr implementation in a specific language should use that language's conventions to access these elements. So the element I address as If we accept that the spec should remain agnostic about indexing conventions, it makes it very hard to write the spec, because at the end of the day, the spec must refer to specific elements in its examples. To mitigate this, we simply state that the spec adopts zero-based C-style indexing when describing specific array elements, and that other implementations may use other indexing conventions. I argued that anything other than this raw, idiomatic indexing must be considered a coordinate transformation from some coordinate space to raw index space. We can denote this N-dimensional raw index space as (using of course the 0-origin convention.) All other coordinate system are ultimately mappings from some other space where the function The affine transforms proposed above could obviously be framed in the same way. CF Conventions on coordinates describe a whole system for mapping discrete coordinate values (stored in other arrays within a group) to raw index space. Xarray (and many other similar software applications) leverage this convention to support label based indexing. It is obviously my opinion that we should not bring any coordinate transformations into the core Zarr spec. In fact, I don't even think we need extensions for them. As the CF conventions show, such coordinate transforms can be described completely by appropriate attributes, without any spec changes needed. For non-zero origin, it could be as simple as an array attribute like:
If we accept this, the challenges (for Jeremy and others who want this feature) are:
One place where we have a vaguely similar situation is with |
I can see that there is a strong opposition to non-zero origins in zarr, so I won't continue pushing it. As suggested, I can indeed implement support outside of the spec in Neuroglancer and TensorStore using user-defined attributes, and will try to coordinate with other implementations that are interested, such as WebKnossos. The one aspect of this proposal that is not as conveniently implemented outside of the core spec is support for growing a dimension in the negative direction, or similarly, allowing dimensions to be unbounded in the negative direction. (To use the mathematical notation, the most significant change in this PR is that Zarr arrays would be indexed by However, that can still be accomplished by choosing an initial origin that is some large number, e.g. There is one remaining portion of this PR that may be less controversial, and which hasn't been discussed yet: dimension names. This is a feature that is very easy to implement outside of the core spec, e.g. a simple "dimension_names" user-defined attribute, but there are some advantages to making it part of the core spec:
|
I agree--this is a tough one. My main objection was that we don't want to mix the notion of coordinates into the core spec. But I do agree that it would be great to allow extending arrays in both directions. I think that can be done without introducing coordinates. Maybe we could support this via an optional extension that would operate at the chunk level, i.e. creating chunks like I am 100% in favor of dimension labels.
Doesn't this kind of imply it should be an extension?
In fact, this has already happened with xarray and nczarr. But I'm convinced that could have been avoided also with a clearly documented extension. I think the only question for me is--core spec vs. extension? You seem to be suggesting that things that are not in the core spec will just be ignored. I was hoping that extensions would be a little more binding and widely adopted. Named dimensions has always been held up as an example of what an extension would do. Very curious to get thoughts from others here. |
I agree that these augmentations to array semantics are both extremely useful and should be part of an extension. If everyone deems these augmentations essential and we observe that essentially every zarr array in the wild uses the extension, then it can be later added to the core spec. |
I think there are a few different options: a. Don't support the "negative index means counting from the end" convention at all. The TensorStore Python API uses option (a). For zarr-python I think option (b) might be the most practical. Presumably with such an extension, the Do I understand correctly that you are supportive of adding support for such an extension to zarr-python?
I think it would be fine as an extension as well --- I don't think there would be much practical difference. I can can create a PR that adds it as an extension if that is preferable. The main thing would be to standardize it early so that multiple incompatible ways to specify dimension labels aren't developed. I imagined that the core spec could also list some features as optional, but if the idea is that nothing in the core spec is optional then I can see an extension might be a better fit. Though it also seems plausible that every implementation would support at least the ability to query the list of dimension labels of an open array, and set the dimension labels when creating an array.
I think that after the initial version of the spec is released, extensions become an important technical measure for adding functionality in a way that explicitly indicates whether old implementations unaware of the extension should either ignore it or fail immediately. Prior to the release of the initial version of the spec, for optional features that don't otherwise require e.g. a data type/codec/chunk layout/storage transformer identifier, it is somewhat arbitrary whether it is an extension.
I am perfectly happy for this (negative indices) functionality to be an extension rather than in the core spec; but I think it would be significantly more useful if support for it would also be accepted for inclusion in zarr-python. |
I don't understand what the trouble is for growing the array in the "negative" direction. Why can't zarr just copy the numpy pad API? I see no need for breaking the conventions for negative indexing here (just as np.pad doesn't break any array indexing conventions). |
This would require moving every chunk file, no? And on s3 that would be a copy+delete. |
@d-v-b raises an interesting point --- if |
Is it baked into the spec that the names of chunks in storage must be positive integers? The name of a chunk in storage is different from how it is addressed from However, as noted earlier, this would require explicitly tracking the name of the first chunk in some metadata. |
☝️ this strikes me as a really core issue. Curious to get @alimanfoo's take on it. |
Since this discussion stalled, I've tried to summarize the different points here, added my and @normanrz's comments and added some proposals how to proceed, to keep progressing with this PR. I hope I didn't oversee anything in the discussion above and am sorry for the long text 🙈 SummaryIMO there are six different points adressed in the discussion (and PR):
Commentson min/max (2.)I'll try to summarize @normanrz's and my thoughts on the proposal to add a minimum: In general, it seems that saving the minimum as well is similar to saving the shape. Regarding negative indices: Since negative indices would create problems in APIs, I'd rather avoid them. Left-padding beyond 0 can still be achieved by an indexing offset (4.). In short: I'd just change the allowed ranges for min/max to be positive, otherwise the changes LGTM. on domain names (1.), domain transformations (3.) and metadata/attribute conventions (6.)The only question left for this PR IMO is if this should be part of the attributes or the general metadata. I'd tend to move everything that is not used for indexing/reading/writing into attributes, but domain names might be important enough to be an exception. Domain transformations don't seem to be well-defined atm and don't change anything in the core zarr mechanics, so I'd rather defer this until multiple domains converge on similar terms (e.g. compare ome/ngff#94). I think such conventions can easily be added later. Next stepsI think we should separate the different points a bit and concentrate on the actual PR proposal at hand. Going through the different points:
|
I've pulled out just the addition of dimension names into a separate PR: |
This adds support for dimension names (#73) and non-zero origins (#122).