Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying how fill_value is handled (if unspecified) #133

Closed
jakirkham opened this issue Feb 16, 2022 · 11 comments · Fixed by #145
Closed

Specifying how fill_value is handled (if unspecified) #133

jakirkham opened this issue Feb 16, 2022 · 11 comments · Fixed by #145

Comments

@jakirkham
Copy link
Member

The Zarr v2 spec leaves undefined fill_values as ambiguous:

If the “fill_value” field is null then the contents of the chunk are undefined.

However if a user writes to a portion of a chunk with different implementations, each implementation will now potentially have different chunks depending on how they handle the fill_value.

This can also cause confusion in other contexts ( for example zarr-developers/zarr-python#966 ).

Also this gets mentioned in James' overview in issue ( #53 ).

Of course there are some advantages of leaving this unspecified. Namely one can use uninitialized memory to allocate each chunk for writing into, which would be a bit faster. Also this can more easily handle new or complicated types where an appropriate default fill_value is not entirely obvious.

That said, given some of the issues above, wonder if we should take a different tack and specify fill_value for types in v3. Thoughts? 🙂

@d-v-b
Copy link
Contributor

d-v-b commented Feb 16, 2022

What about defining the default fill value for a datatype to be the instance of that type with a binary representation of all 0s? Are there any types where this would be counter-intuitive or bad somehow?

@jakirkham
Copy link
Member Author

What does that mean for bytes, str, object, etc.?

@d-v-b
Copy link
Contributor

d-v-b commented Feb 16, 2022

I don't know, I never work with those types :) But this "all 0s in its binary representation" idea doesn't seem to make any sense for variable-length types, so it's probably a non-starter. Another option would be to just require a fill_value. Explicit is better than implicit and all that.

@jbms
Copy link
Contributor

jbms commented Feb 22, 2022

I agree --- fill_value should be required.

There are a few meanings you could potentially assign to an unspecified fill_value:

  1. Same as specifying zero or other default value for the type, like empty string.
  2. Read returns uninitialized memory --- this is a potential security vulnerability because it may leak information like encryption keys, credentials, etc.
  3. Reading a missing chunk returns an error. However this leads to inconsistent behavior for partially-written chunks, since there is no way to indicate that it is partially written.

Since 1 is the only reasonable option I don't see any advantage in allowing an unspecified fill value.

@jakirkham
Copy link
Member Author

@WardF how does NetCDF handle this?

@dopplershift
Copy link

NetCDF has a default fill value for each of the types, at least for the classic (netCDF3) format. See the end of the grammar in the file format spec. It's unclear what the defaults are for the rest of the types added for the netCDF4 format, so I'll leave that to @WardF .

@jakirkham
Copy link
Member Author

Related is how we handle encoding of some more atypical fill values. We discussed this briefly during the community meeting. Tried to summarize below (though please feel free to correct me).

Currently the fill value has a default value concept, which applies to both during construction (it can be None, but can be deduced to mean something) and in the metadata itself (where it can be null, but is determined at runtime). The questions are

  1. Should users be required to define a fill value at construction time?
  2. Should the metadata allow fill values to be null?

For 1, this can either be required or we add some kind of lookup table for determining this. Sounds like NetCDF has the latter. That all being said, as this is a question of the API and not the storage format itself. So perhaps this can be left up to the implementations at present.

For 2, it sounds like other storage implementations (like NetCDF) don't do this and instead always store the fill value. Perhaps a good first step to resolving this issue would be to do the same in v3.

Separately there was some discussion around how to handle encoding fill values of more unusual types. Issue ( zarr-developers/zarr-python#216 ) came up. In particular @d-v-b brought up base64 encoding ( zarr-developers/zarr-python#216 (comment) ). Also @manzt mentioned HTTP handles it this way, which we could borrow from.

@jakirkham
Copy link
Member Author

Trying to address in PR ( #145 )

@manzt
Copy link
Member

manzt commented Jun 1, 2022

Also worth nothing that https://github.com/fsspec/kerchunk just uses a base64: prefix for encoding binary data in a JSON. somewhat buried in the spec...

the str format of a reference value may be:

    a string starting "base64:", which will be decoded to binary
    any other string, interpreted as ascii data

Not sure where this convention comes, but it's a more simple alternative to a full data-uri (data:application/octet-stream;base64,<data>).

@joshmoore
Copy link
Member

With an eye on #145 and the upcoming review of ZEP0001, is there a path forward here?

@joshmoore
Copy link
Member

Just a heads up that there may be some interesting dangling conversations here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants