Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of fields in asset definition file is significant for hashing #15

Open
jeffomatic opened this issue May 9, 2015 · 4 comments
Open

Comments

@jeffomatic
Copy link

This is a subtle point, but if we're going to use hashing for verifying the asset definition file, then it's probably best that we define a strict ordering and formatting of the JSON so that idiosyncrasies in JSON serialization and/or rendering don't affect the value of the hash.

Given that the asset definition file is JSON, it's likely to be stored as such, and possibly deserialized and re-serialized before being rendered, or after being fetched for verification. JSON as a format doesn't care about indentation, newlines, or field ordering, but the hash operation does care about those things.

Off the top of my head, there are a couple of options:

  1. Treat the asset definition file as a raw string. Encourage providers (issuers) to avoid deserializing the string after extracting it from the data store and encourage verifiers to read the raw HTTP body for hash generation.
  2. Extend the standard to specify pre-processing before calculating the hash. For example, to verify an asset definition file, the verifier would deserialize the contents, re-render them as a string with sorted keys (or some ordering defined in the standard), strip out unnecessary whitespace (most JSON serializers will do this by default), maybe treat blank strings or null values as if the key was removed, and then take the hash.
@dthorpe
Copy link

dthorpe commented May 9, 2015

Good points.

@oleganza
Copy link
Contributor

@jeffomatic what is the practical use case where checking/enforcing some sort of canonical encoding is important?

If I understand correctly, it's not only ordering of fields that matters, but also Unicode encoding must have some canonical form. As of now OA clients just need to preserve the original binary file, so no one cares how exactly it is formatted (we don't have malleability issues like with bitcoin transactions). This proposal requires every client to do additional checks or sort and canonically encode data to be sure it matches the hash. Not every language/platform has appropriate unicode equivalence conversion functions, or a correct implementation of them. Also, sorting the JSON object keys is not trivial. I haven't yet seen a library that has an API for that.

In addition to above, a good chunk of existing highly important assets uses pretty-printed JSON formatting which probably won't be compatible with any kind of strict encoding rules.

@jeffomatic
Copy link
Author

@oleganza Yeah, good observations!

Even if canonicalizing the JSON is impossible, then I think it would be nice to get the following updates to the documentation:

  1. The docs shouldn't claim that the ordering of fields is insignificant, at least in the case where you're re-rendering the definition for hash verification. Not only is the ordering significant, but so are whitespace, floating-point serialization, Unicode, and probably some other things I'm forgetting. Those all have to be exactly the same as when you first generated that hash.
  2. The docs should probably remind both issuers and verifiers that if they care about hash verification, they need to think of the asset definition as a raw byte string, generated non-canonically at a specific moment in time, rather than as pure structured data.

The rest of this message might be purely academic. Regarding your concerns about practicality, I think the issues you point out with canonicalizing JSON are very real, and could be deal-breakers. But if we magically had some kind of canonical (if imperfect) serialization method, I'm tempted to think that a canonicalize_then_hash procedure would be easier to specify, and probably less error-prone, than asking people to treat the definition as raw bytes. You could package the procedure as a blackbox hashing function that takes the asset definition in any arbitrary format, and call it any time you wanted to know the hash.

Without such a procedure, we put a burden on every server app and every client app to persist, render, and retrieve the asset definition in a specific way. In particular, regardless of how servers are storing the asset definition (e.g. as multiple columns in a SQL table, or a MongoDB document, etc.), you'd want to make sure they're also persisting the original JSON message used to generate the hash digest at the time of issuance, AND make sure they are delivering the original JSON message instead of rendering the structured data through an arbitrary pretty-printer. Likewise, clients would need to keep track of the raw HTTP body of the asset definition before proceeding with any kind of deserialization, which means that dynamic-language applications need to be careful about using any HTTP request library that performs automatic deserialization (which is a pretty widespread practice).

I'm not sure if this is such a huge deal, so to reiterate, I think even making some adjustments to the documentation would be beneficial.

@NicolasDorier
Copy link
Contributor

The best bug free way to make it works is to hash the bytes provided by the asset definition url.
Canonicalization is a mine field, that would eventually lead to a lot of problem for every implementers.

I think it is correct to say that the field ordering is not important, and say explicitly that the hash refer to the raw bytes provided by the asset definition url.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants