-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object array fill value #216
Comments
Hard to see how to resolve this one without a spec change and/or breaking compatibility with previous data. May have to punt on this for 2.2. |
Hopefully to make this clearer to someone less familiar with Zarr (and also ensure I understand the issue). Please correct me if any of this is wrong. The A workaround would be to not use |
Yes pretty much. In fact it is currently impossible to create an object
array with a fill_value that is not JSON serializable, because an error
occurs when zarr tries to write .zarray. So the fill_value has to be JSON
serializable.
The fill_value also has to be compatible with the object_codec. I.e., the
object_codec has to be able to encode an array containing the given
fill_value.
So if creating an object array, current options are to use fill_value=None
which should be compatible with any object_codec, or fill_value=0 or
fill_value='' (or any unicode string value) may work depending on which
object_codec is used.
Using zarr.empty(...) to create object arrays is probably the most
convenient thing to do, which this is a short-hand for zarr.create(...,
fill_value=None).
…On Thu, Dec 7, 2017 at 4:56 PM, jakirkham ***@***.***> wrote:
Hopefully to make this clearer to someone less familiar with Zarr (and
also ensure I understand the issue). Please correct me if any of this is
wrong.
The fill_value is stored in .zarray, which is a JSON file. Hence the
fill_value has to be JSON serializable.
A workaround would be to not use fill_value for these cases and simply
fill all values in the array with the intended default value. This will be
less space efficient than having the intended fill_value set, but is an
option for now.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/216#issuecomment-350028861>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QvqN1y_tuEOf5pMZdoK_OZb25tU3ks5s-BjPgaJpZM4Q5PyW>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Thanks for the follow-up. Do we guarantee that |
Yes currently fill_value is required in .zarray by the spec.
A better option would have been to treat the fill_value like a special
chunk with one element, then it would have passed through the same codecs
as the data. That would also allow us to get rid of a bunch of code that
currently handles mapping of fill values into JSON for different dtypes.
But that would require a spec change I think.
…On Thu, Dec 7, 2017 at 5:21 PM, jakirkham ***@***.***> wrote:
Thanks for the follow-up. Do we guarantee that fill_value is always
specified in .zarray? If not, what if we started storing the fill_value
to its own "chunk"?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/216#issuecomment-350036024>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QvJGPBvc8B3l2GQC5gQY0qKwztb8ks5s-B6ggaJpZM4Q5PyW>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Yeah that sounds ideal. An alternative might be to base64 encode it at the end and shove it back in JSON. Can flesh this out a bit more if it is interesting. |
Yes I think it would be very reasonable to base64 encode the fill value and
put it in the JSON where it currently lives under the 'fill_value' key.
In fact for arrays with 'S' (fixed length byte string) dtype, I recently
made this change to the code and spec, i.e., the bytes fill value is base64
encoded. This didn't break compatibility because previously it was
impossible to store a bytes fill value because it would have raised a JSON
encoding error, and so we could be sure there would be no existing data out
there that had a fill value without base64 encoding.
For object arrays it's not quite so simple, because at least in theory it
would previously have been possible for someone to create an object array
with a unicode string fill value (e.g., '' or 'foobar' or 'whatever') or
any other valid JSON fill value (e.g., 0) and this would have been stored
in the JSON. So if we now switched behaviour and started base64 encoding
fill values for object arrays, this would break compatibility with previous
data, because on read Zarr would wrongly try to base64 decode their fill
values. Now it may be that actually there is no data out there like this,
but I feel like we need to assume there is and manage any changes that
could break compatibility with previous data. By "manage" I think I mean
incrementing the storage spec version number, and providing a migration
path and/or compatibility layer so that things don't break or behave
unexpectedly. I'm keen to ship 2.2 and so want to avoid this for the moment.
That said there may be a clever way to add support for encoding object fill
values without breaking compatibility, but I can't see it.
…On Thu, Dec 7, 2017 at 5:35 PM, jakirkham ***@***.***> wrote:
Yeah that sounds ideal.
An alternative might be to base64 encode it at the end and shove it back
in JSON. Can flesh this out a bit more if it is interesting.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/issues/216#issuecomment-350039795>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QuCAn8fXtsEg_qAA1MCeGek39mUIks5s-CHdgaJpZM4Q5PyW>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Guess I was imagining the "fill_value": {
"base64": "..."
} Also eager to see 2.2 ship reasonably soon. There's a lot of great functionality already built up in |
To propose another option here, we could use jsonpickle. This can handle a wide variety of cases for us here. Though there are the usual concerns with pickling data generally. I'm not sure how we avoid those. |
Good to know about jsonpickle, though we should probably try to find a
cross-platform solution. E.g., it might be safest in the general case to
encode the fill value using the same encoding pipeline that's used for
chunks. That would make for non-human-readable values, but maybe that is
something we have to live with.
For the particular case of string arrays, where it might be convenient to
have a human-readable value in the array metadata, maybe we could also
allow a JSON string as the fill value?
…On Tue, 4 Dec 2018 at 05:36, jakirkham ***@***.***> wrote:
To propose another option here, we could use jsonpickle
<http://jsonpickle.github.io/>. This can handle a wide variety of cases
for us here. Though there are the usual concerns with pickling data
generally. I'm not sure how we avoid those.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#216 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QrX2aecVX3X4RhGrb0wFuImmZb2bks5u1gnNgaJpZM4Q5PyW>
.
--
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
|
If a fill value is provided for an array with object dtype, and the fill value cannot be JSON encoded, an error will occur:
The text was updated successfully, but these errors were encountered: