-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for inf, nan, binary data in attributes #141
Comments
Also related ( #81 ) Somewhat related more general Zarr Python discussion ( zarr-developers/zarr-python#216 ) |
Here's the current code in Zarr Python for handling these values. One thought might be to graduate this to inclusion in the spec as-is. Idk if there are other approaches worth considering. For more complex objects Base64 encoding has come up as an idea. |
One difficulty with the current zarr-python approach is that it means the "JSON" metadata is not actually spec-compliant JSON and cannot be parsed by the JavaScript For binary data it is certainly possible to encode it as a string containing base64 encoding, etc. However, there is a similar problem as with encoding Infinity/NaN as JSON strings --- you then lose the type information, so even if you automatically convert such values when encoding as json, there won't be a way to decode them automatically since they aren't distinguishable from ordinary strings. In general I think it would be valuable for zarr v3 attributes to be able to "round trip" any supported zarr v3 data type, since storing a data value as an attribute is a very common use case and it would be nice for every individual use case not to have to invent its own convention. |
Being able to parse with normal JSON libraries makes sense. Well the type information of the Or is the concern the "fill_value": {
"base64": "..."
} Are there other concerns missed here? |
For I think it would be nice if I intended this issue to be about user-defined attributes, though. For example, a common use case might be to store the min/max range of an array, or the bucket partitions for a histogram as attributes. |
Ah sorry thought from our conversation earlier this was That being said, the issue is more or less the same in either case. We are storing values in JSON that may not representable using standard JSON types. In these cases we do know how we would encode the data to write in a binary file. That same strategy could be leveraged to binary encode those values as if they were going in a chunk then convert to Understand the concern about human readability. With things like NaN and Infinity, we could quote them if that is sufficient. Zarr Python should be doing this currently. If it is not, that is a bug. We could also attempt JSON encoding these values in a readable way if possible. Only falling back to |
The difference between For example, we probably want to distinguish between the user storing an attribute For user-defined attributes, if we want to support arbitrary data values, then we need to decide on the data model and there needs to be a way to unambiguously encode them. For example, we might choose a data model of JSON augmented with support for +/-Infinity, NaN, and byte strings. If we want to stick with encoding attributes as valid JSON, we could define an escaping mechanism so that the following Python attribute dictionary: dict(a="Infinity", b=b"some binary representation", c=float("inf")} is encoded as the following JSON metadata representation: {..., "attribute_escape_key": "xxx", "attributes": {"a": "Infinity", "b": {"xxx": "binary", "value": "base64 representation..."}, "c": {"xxx": "Infinity"}} When encoding, the implementation must choose an escape key "xxx" that does not occur as an actual key in any attribute value that is a dictionary. |
I generally like the idea of encoding things as objects if necessary. If possible, I'd prefer to have one clear way of identifying them, e.g. |
Would be great to get the unidata perspective here. NetCDF supports typed attributes, so they have had to confront this already with nczarr. cc @WardF and @DennisHeimbigner. |
First, note that netcdf-c does not allow multiple occurrences of an attribute |
Good discussion this evening around this topic during the community meeting. Present company seems to be leaning towards a JSON-object-style encoding of one form or another. See the notes for more. |
I'm proposing to go forward with JSON encoding as the default for v3 for now, and explicitly allowing arbitrary JSON entries for a given key in the user-attributes, which itself must be a string. This is effectively the same as the spec already defines since JSON object literals must have strings as keys. See #173. This would currently not allow good representations of inf/nan or bytestrings in user attributes. (As mentioned above, there's special handling for fill-values). If somebody would create a PR to add this to the current spec I think it would be valid to include it. Alternative encodings to JSON may be added by extensions later, especially if user attributes may be stored in a separate document (see #72). Just as a side-note: yaml would both support inf/nan/binary, as well as custom type definitions. Might be a useful extension, and IMO more human-readable than CBOR |
In both v2 and v3 specs, attributes are stored as JSON which means they are subject to the limitations of JSON:
Infinity
orNaN
within the JSON output). This support may be accidental rather than intentional, though.The v3 spec suggests the possibility of using a binary encoding like CBOR for metadata which would solve these issues. However my understanding is that CBOR would not be the default.
The text was updated successfully, but these errors were encountered: