Support for inf, nan, binary data in attributes #141

jbms · 2022-05-04T19:33:38Z

In both v2 and v3 specs, attributes are stored as JSON which means they are subject to the limitations of JSON:

Infinity and NaN cannot be represented, although currently zarr-python supports these via a non-standard extension to JSON format (directly encoding them as unquoted Infinity or NaN within the JSON output). This support may be accidental rather than intentional, though.
Byte strings (as opposed to Unicode strings) cannot be represented.

The v3 spec suggests the possibility of using a binary encoding like CBOR for metadata which would solve these issues. However my understanding is that CBOR would not be the default.

The text was updated successfully, but these errors were encountered:

jakirkham · 2022-06-15T18:53:44Z

Also related ( #81 )

Somewhat related more general Zarr Python discussion ( zarr-developers/zarr-python#216 )

jakirkham · 2022-06-15T19:04:43Z

Here's the current code in Zarr Python for handling these values. One thought might be to graduate this to inclusion in the spec as-is. Idk if there are other approaches worth considering.

For more complex objects Base64 encoding has come up as an idea.

jbms · 2022-06-15T19:17:27Z

One difficulty with the current zarr-python approach is that it means the "JSON" metadata is not actually spec-compliant JSON and cannot be parsed by the JavaScript JSON.parse function or by the popular C++ nlohmann json library. However, other libraries like RapidJSON can handle the Infinity/NaN extension.

For binary data it is certainly possible to encode it as a string containing base64 encoding, etc. However, there is a similar problem as with encoding Infinity/NaN as JSON strings --- you then lose the type information, so even if you automatically convert such values when encoding as json, there won't be a way to decode them automatically since they aren't distinguishable from ordinary strings.

In general I think it would be valuable for zarr v3 attributes to be able to "round trip" any supported zarr v3 data type, since storing a data value as an attribute is a very common use case and it would be nice for every individual use case not to have to invent its own convention.

jakirkham · 2022-06-15T19:24:49Z

Being able to parse with normal JSON libraries makes sense.

Well the type information of the fill_value should match the data_type of the array, right?

Or is the concern the fill_value wouldn't be understood to be base64 encoded? In comment ( zarr-developers/zarr-python#216 (comment) ), the syntax below was proposed. Maybe there are alternative ways to capture this?

    "fill_value": {
        "base64": "..."
    }

Are there other concerns missed here?

jbms · 2022-06-15T19:31:35Z

For fill_value the data type is already known so there isn't an issue there. zarr-python for v2 already uses a different encoding for fill_value ---- infinity is encoded as "Infinity" rather than Infinity.

I think it would be nice if fill_value can be expressed in a readable way but that isn't critical.

I intended this issue to be about user-defined attributes, though. For example, a common use case might be to store the min/max range of an array, or the bucket partitions for a histogram as attributes.

jakirkham · 2022-06-16T06:08:16Z

Ah sorry thought from our conversation earlier this was fill_value focused.

That being said, the issue is more or less the same in either case. We are storing values in JSON that may not representable using standard JSON types.

In these cases we do know how we would encode the data to write in a binary file. That same strategy could be leveraged to binary encode those values as if they were going in a chunk then convert to base64 before storing in JSON as string (possibly with an object indicating the encoding like above).

Understand the concern about human readability. With things like NaN and Infinity, we could quote them if that is sufficient. Zarr Python should be doing this currently. If it is not, that is a bug. We could also attempt JSON encoding these values in a readable way if possible. Only falling back to base64 if that is not an option.

jbms · 2022-06-16T17:37:19Z

The difference between fill_value and user-defined attributes is that for fill_value, the data type is specified elsewhere in the metadata and can be used to decode whatever representation is used.

For example, we probably want to distinguish between the user storing an attribute "a" with a value of the string "Infinity" and the user storing an attribute "a" with a value of the floating-point number Infinity. If both are encoded as {"a": "Infinity"} then that is not possible.

For user-defined attributes, if we want to support arbitrary data values, then we need to decide on the data model and there needs to be a way to unambiguously encode them.

For example, we might choose a data model of JSON augmented with support for +/-Infinity, NaN, and byte strings.

If we want to stick with encoding attributes as valid JSON, we could define an escaping mechanism so that the following Python attribute dictionary:

dict(a="Infinity", b=b"some binary representation", c=float("inf")}

is encoded as the following JSON metadata representation:

{..., "attribute_escape_key": "xxx", "attributes": {"a": "Infinity", "b": {"xxx": "binary", "value": "base64 representation..."}, "c": {"xxx": "Infinity"}}

When encoding, the implementation must choose an escape key "xxx" that does not occur as an actual key in any attribute value that is a dictionary.

joshmoore · 2022-06-22T12:46:11Z

I generally like the idea of encoding things as objects if necessary. If possible, I'd prefer to have one clear way of identifying them, e.g. {"@type": "...", ...}.

rabernat · 2022-06-22T13:44:30Z

Would be great to get the unidata perspective here. NetCDF supports typed attributes, so they have had to confront this already with nczarr. cc @WardF and @DennisHeimbigner.

DennisHeimbigner · 2022-06-22T19:15:24Z

First, note that netcdf-c does not allow multiple occurrences of an attribute
with the same name, even if the types are different.
I am not sure if relevant, but NCZarr adds an extra attribute named _nczarr_attr
whose value is a JSON dictionary that contains extra information about the other attributes defined by .zattr.
Specifically, it contains a key called "types" maps each attribute name to
a type such as "<i4".

joshmoore · 2022-09-07T19:07:29Z

Good discussion this evening around this topic during the community meeting. Present company seems to be leaning towards a JSON-object-style encoding of one form or another. See the notes for more.

jstriebel · 2022-11-23T13:55:24Z

I'm proposing to go forward with JSON encoding as the default for v3 for now, and explicitly allowing arbitrary JSON entries for a given key in the user-attributes, which itself must be a string. This is effectively the same as the spec already defines since JSON object literals must have strings as keys. See #173.

This would currently not allow good representations of inf/nan or bytestrings in user attributes. (As mentioned above, there's special handling for fill-values). If somebody would create a PR to add this to the current spec I think it would be valid to include it.

Alternative encodings to JSON may be added by extensions later, especially if user attributes may be stored in a separate document (see #72).

Just as a side-note: yaml would both support inf/nan/binary, as well as custom type definitions. Might be a useful extension, and IMO more human-readable than CBOR

jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 16, 2022

This was referenced Nov 16, 2022

stdlib json not up to json spec. #81

Open

Non-JSON metadata and attributes #37

Open

jstriebel mentioned this issue Nov 23, 2022

clarify attributes field in metadata #173

Merged

jstriebel added the protocol-extension Protocol extension related issue label Nov 23, 2022

This was referenced Nov 23, 2022

Apply more ZEP 1 feedback #171

Merged

v3 spec: Consider removing metadata encoding #174

Closed

abkfenris mentioned this issue Feb 7, 2023

Allow NaN attributes in datasets xpublish-community/xpublish#152

Merged

rabernat mentioned this issue Apr 11, 2023

Extension for attribute datatypes? #229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for inf, nan, binary data in attributes #141

Support for inf, nan, binary data in attributes #141

jbms commented May 4, 2022

jakirkham commented Jun 15, 2022 •

edited

Loading

jakirkham commented Jun 15, 2022

jbms commented Jun 15, 2022

jakirkham commented Jun 15, 2022

jbms commented Jun 15, 2022

jakirkham commented Jun 16, 2022

jbms commented Jun 16, 2022

joshmoore commented Jun 22, 2022

rabernat commented Jun 22, 2022

DennisHeimbigner commented Jun 22, 2022

joshmoore commented Sep 7, 2022

jstriebel commented Nov 23, 2022 •

edited

Loading

Support for inf, nan, binary data in attributes #141

Support for inf, nan, binary data in attributes #141

Comments

jbms commented May 4, 2022

jakirkham commented Jun 15, 2022 • edited Loading

jakirkham commented Jun 15, 2022

jbms commented Jun 15, 2022

jakirkham commented Jun 15, 2022

jbms commented Jun 15, 2022

jakirkham commented Jun 16, 2022

jbms commented Jun 16, 2022

joshmoore commented Jun 22, 2022

rabernat commented Jun 22, 2022

DennisHeimbigner commented Jun 22, 2022

joshmoore commented Sep 7, 2022

jstriebel commented Nov 23, 2022 • edited Loading

jakirkham commented Jun 15, 2022 •

edited

Loading

jstriebel commented Nov 23, 2022 •

edited

Loading