Specifying how `fill_value` is handled (if unspecified) #133

jakirkham · 2022-02-16T18:50:33Z

The Zarr v2 spec leaves undefined fill_values as ambiguous:

If the “fill_value” field is null then the contents of the chunk are undefined.

However if a user writes to a portion of a chunk with different implementations, each implementation will now potentially have different chunks depending on how they handle the fill_value.

This can also cause confusion in other contexts ( for example zarr-developers/zarr-python#966 ).

Also this gets mentioned in James' overview in issue ( #53 ).

Of course there are some advantages of leaving this unspecified. Namely one can use uninitialized memory to allocate each chunk for writing into, which would be a bit faster. Also this can more easily handle new or complicated types where an appropriate default fill_value is not entirely obvious.

That said, given some of the issues above, wonder if we should take a different tack and specify fill_value for types in v3. Thoughts? 🙂

The text was updated successfully, but these errors were encountered:

d-v-b · 2022-02-16T23:12:17Z

What about defining the default fill value for a datatype to be the instance of that type with a binary representation of all 0s? Are there any types where this would be counter-intuitive or bad somehow?

jakirkham · 2022-02-16T23:16:10Z

What does that mean for bytes, str, object, etc.?

d-v-b · 2022-02-16T23:23:49Z

I don't know, I never work with those types :) But this "all 0s in its binary representation" idea doesn't seem to make any sense for variable-length types, so it's probably a non-starter. Another option would be to just require a fill_value. Explicit is better than implicit and all that.

jbms · 2022-02-22T07:16:22Z

I agree --- fill_value should be required.

There are a few meanings you could potentially assign to an unspecified fill_value:

Same as specifying zero or other default value for the type, like empty string.
Read returns uninitialized memory --- this is a potential security vulnerability because it may leak information like encryption keys, credentials, etc.
Reading a missing chunk returns an error. However this leads to inconsistent behavior for partially-written chunks, since there is no way to indicate that it is partially written.

Since 1 is the only reasonable option I don't see any advantage in allowing an unspecified fill value.

jakirkham · 2022-04-06T18:45:31Z

@WardF how does NetCDF handle this?

dopplershift · 2022-04-07T17:20:54Z

NetCDF has a default fill value for each of the types, at least for the classic (netCDF3) format. See the end of the grammar in the file format spec. It's unclear what the defaults are for the rest of the types added for the netCDF4 format, so I'll leave that to @WardF .

jakirkham · 2022-06-01T19:00:09Z

Related is how we handle encoding of some more atypical fill values. We discussed this briefly during the community meeting. Tried to summarize below (though please feel free to correct me).

Currently the fill value has a default value concept, which applies to both during construction (it can be None, but can be deduced to mean something) and in the metadata itself (where it can be null, but is determined at runtime). The questions are

Should users be required to define a fill value at construction time?
Should the metadata allow fill values to be null?

For 1, this can either be required or we add some kind of lookup table for determining this. Sounds like NetCDF has the latter. That all being said, as this is a question of the API and not the storage format itself. So perhaps this can be left up to the implementations at present.

For 2, it sounds like other storage implementations (like NetCDF) don't do this and instead always store the fill value. Perhaps a good first step to resolving this issue would be to do the same in v3.

Separately there was some discussion around how to handle encoding fill values of more unusual types. Issue ( zarr-developers/zarr-python#216 ) came up. In particular @d-v-b brought up base64 encoding ( zarr-developers/zarr-python#216 (comment) ). Also @manzt mentioned HTTP handles it this way, which we could borrow from.

jakirkham · 2022-06-01T19:07:49Z

Trying to address in PR ( #145 )

manzt · 2022-06-01T19:16:55Z

Also worth nothing that https://github.com/fsspec/kerchunk just uses a base64: prefix for encoding binary data in a JSON. somewhat buried in the spec...

https://github.com/fsspec/kerchunk/blob/f703e5e7af53b6eb08483c59c37efa66a1d3e8ac/docs/source/spec.rst

the str format of a reference value may be:

    a string starting "base64:", which will be decoded to binary
    any other string, interpreted as ascii data

Not sure where this convention comes, but it's a more simple alternative to a full data-uri (data:application/octet-stream;base64,<data>).

joshmoore · 2022-07-11T12:51:00Z

With an eye on #145 and the upcoming review of ZEP0001, is there a path forward here?

joshmoore · 2022-11-10T14:58:15Z

Just a heads up that there may be some interesting dangling conversations here.

joshmoore mentioned this issue Mar 15, 2022

Handle missing chunks freeman-lab/zarr-js#32

Closed

katamartin mentioned this issue Mar 22, 2022

Use fill_value to handle missing chunks freeman-lab/zarr-js#34

Merged

jakirkham mentioned this issue Jun 1, 2022

Require fill_value to be defined #145

Merged

joshmoore closed this as completed in #145 Nov 7, 2022

TomAugspurger mentioned this issue Sep 17, 2024

Is _FillValue really the same as zarr's fill_value? pydata/xarray#5475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying how `fill_value` is handled (if unspecified) #133

Specifying how `fill_value` is handled (if unspecified) #133

jakirkham commented Feb 16, 2022

d-v-b commented Feb 16, 2022

jakirkham commented Feb 16, 2022

d-v-b commented Feb 16, 2022

jbms commented Feb 22, 2022

jakirkham commented Apr 6, 2022

dopplershift commented Apr 7, 2022

jakirkham commented Jun 1, 2022

jakirkham commented Jun 1, 2022

manzt commented Jun 1, 2022 •

edited

Loading

joshmoore commented Jul 11, 2022

joshmoore commented Nov 10, 2022

Specifying how fill_value is handled (if unspecified) #133

Specifying how fill_value is handled (if unspecified) #133

Comments

jakirkham commented Feb 16, 2022

d-v-b commented Feb 16, 2022

jakirkham commented Feb 16, 2022

d-v-b commented Feb 16, 2022

jbms commented Feb 22, 2022

jakirkham commented Apr 6, 2022

dopplershift commented Apr 7, 2022

jakirkham commented Jun 1, 2022

jakirkham commented Jun 1, 2022

manzt commented Jun 1, 2022 • edited Loading

joshmoore commented Jul 11, 2022

joshmoore commented Nov 10, 2022

Specifying how `fill_value` is handled (if unspecified) #133

Specifying how `fill_value` is handled (if unspecified) #133

manzt commented Jun 1, 2022 •

edited

Loading