Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce module in trie-db for generating/verifying trie proofs. #45

Merged
merged 2 commits into from
Jan 8, 2020

Conversation

jimpo
Copy link
Contributor

@jimpo jimpo commented Dec 10, 2019

Generation and verification of compact proofs for Merkle-Patricia tries. These have the benefit over the compact trie encoding of omitting the values from the proof data. Apart from being a more standard logical separation for a proof interface, this can result in bandwidth savings if the verifier already has the values. For example, they may request a proof that a value at a key is the same as it is with respect to another trie root.

Using this module, it is possible to generate a logarithmic-space proof of inclusion or non-inclusion of certain key-value pairs in a trie with a known root. The proof contains information so that the verifier can reconstruct the subset of nodes in the trie required to lookup the keys. The trie nodes are not included in their entirety as data which the verifier can compute for themself is omitted. In particular, the values of included keys and and hashes of other trie nodes in the proof are omitted.

The proof is a sequence of the subset of nodes in the trie traversed while performing lookups on all keys. The trie nodes are listed in pre-order traversal order with some values and internal hashes omitted. In particular, values on leaf nodes, child references on extension nodes, values on branch nodes corresponding to a key in the statement, and child references on branch nodes corresponding to another node in the proof are all omitted. The proof is verified by iteratively reconstructing the trie nodes using the values proving as part of the statement and the hashes of other reconstructed nodes. Since the nodes in the proof are arranged in pre-order traversal order, the construction can be done efficiently using a stack.

Fixes paritytech/substrate#3782.

@rphmeier rphmeier merged commit b283aad into master Jan 8, 2020
@rphmeier rphmeier deleted the jimpo/trie-proof branch January 8, 2020 15:50
@hujw77
Copy link

hujw77 commented Nov 4, 2020

Is there a plan to expose this API in substrate-state-machine crate?

@cheme
Copy link
Contributor

cheme commented Nov 4, 2020

There was paritytech/substrate#4938 , not sure how I will attempt to resume work on it (I recently updated the branch though, but I am also considering different approach to integrate it)

@hujw77
Copy link

hujw77 commented Nov 4, 2020

I wrote a solidity version of the verification. I am working on the storage proof of substrate, Could you give me the specification of substrate's storage. Thanks!

@cheme
Copy link
Contributor

cheme commented Nov 4, 2020

https://github.com/w3f/polkadot-spec/blob/master/host-spec/c02-state.tm (https://github.com/w3f/polkadot-spec/releases) gives you the basis for the trie state encoding. (maybe reading gossamer or kogame implementation can be interesting (different language and code base so can be easier for some)).
Then the current proof (non compact) are just a scale encoded Vec of encoded nodes as done by derived encoding of https://github.com/paritytech/substrate/blob/f7a8b1001d1819b7a887ae36d6beae84617499d8/primitives/trie/src/storage_proof.rs#L29 .
The nodes can be stored in any order.

For compact proof, it is the same thing except nodes needs to be ordered in a specific way and useless encoded hashes are replaced by encoding of a 0 length inline node (

Some(empty_child)
).
Useless encoded hash are those that can be recalculated when rebuilding the trie (any child hash that point to a node that is already in the proof).

There is also a variant of compact proof where we want to check if a set of storage values did change, and where we also omit writing the value in the encoded nodes (basically code in https://github.com/paritytech/trie/tree/master/trie-db/src/proof).

But mostly I don't think there is written spec of those compact proof except code documentation.
Also I would probably wait a bit before re-implementing them (there is no guaranty they will make it to substrate, even if I think it would be very good).

@hujw77
Copy link

hujw77 commented Nov 4, 2020

Trie node encoding specification

Note that for the following definitions, | denotes concatenation

Branch encoding:
NodeHeader | Extra partial key length | Partial Key | Value
NodeHeader is a byte such that:
most significant two bits of NodeHeader: 10 if branch w/o value, 11 if branch w/ value
least significant six bits of NodeHeader: if len(key) > 62, 0x3f, otherwise len(key)
Extra partial key length is included if len(key) > 63 and consists of the remaining key length
Partial Key is the branch's key
Value is: Children Bitmap | SCALE Branch node Value | Hash(Enc(Child[i_1])) | Hash(Enc(Child[i_2])) | ... | Hash(Enc(Child[i_n]))

Leaf encoding:
NodeHeader | Extra partial key length | Partial Key | Value
NodeHeader is a byte such that:
most significant two bits of NodeHeader: 01
least significant six bits of NodeHeader: if len(key) > 62, 0x3f, otherwise len(key)
Extra partial key length is included if len(key) > 63 and consists of the remaining key length
Partial Key is the leaf's key
Value is the leaf's SCALE encoded value

This is the trie node specification, I found it in gossamer and https://github.com/w3f/polkadot-spec/blob/master/host-spec/c02-state.tm
I have some questions?

  1. The current proof (non compact) and compact proof are the same encoding?
  2. I can understand the logic of compact proof, but the logic or algorithm of current proof I still do not understand. Can you show me the code or something which I can dive in the current proof?
    Thank you again!

@cheme
Copy link
Contributor

cheme commented Nov 4, 2020

  1. The current proof (non compact) and compact proof are the same encoding?

At trie node level it is the same (compact uses 'empty inline node' which is impossible as a way to add information without changing the encoding).

Then at proof level:
Non compact is a set of nodes.
Compact is an ordered set of nodes.
But in the end they both are represented as a list of encoded nodes (Vec<Vec>).
Just in the compact case the order of the nodes defines the structure of the trie and 'empty inline node' indicate the child is in the proof and the hash should be calculated from it.

One other tiny difference is that for compact proof you cannot have a single proof with nodes from different tries, so in case of child trie used by the proof, compact will need a different encoded where the proof is split by trie (in the PR I did, this was named Full when Flat was a single trie proof).

  1. I can understand the logic of compact proof, but the logic or algorithm of current proof I still do not understand. Can you show me the code or something which I can dive in the current proof?

The set of nodes is put in a hashmap with key being the encoded node hash.
Then to verify the proof we just run the process (can be any runtime call or key value access) to check over a trie that use this hash map as a encoded node backend.

So if we check a single key access, then we start from root, fetch the encoded root node from the hashmap, decode it, get the child hash for this key, fetch the child encoded node, decode it, repeat
until either
we got our value
or we fail to fetch a child (incomplete proof) or have a branch where the child hash is not define (missing value).

In https://github.com/paritytech/substrate/blob/f7a8b1001d1819b7a887ae36d6beae84617499d8/primitives/state-machine/src/lib.rs#L816
'create_proof_check_backend' is just instantiating a trie backend that instead of using rocksdb for node storage uses a hashmap build from the encoded nodes in the proof (https://github.com/paritytech/substrate/blob/833fe6259115625f61347c8413bab29fded31210/primitives/state-machine/src/proving_backend.rs#L291).
Then 'read_child_proof_check_on_proving_backend' is just running the 'get' ('storage') operation for every keys.

@hujw77
Copy link

hujw77 commented Nov 5, 2020

I implemented a solidity version of the two verification. Welcome advice~
https://github.com/HuJingwei/merkle-proof
 

 

@cheme
Copy link
Contributor

cheme commented Nov 5, 2020

Amazing 👍
Long time I did not do or read solidity, wondering what is the threshold (in number of nodes access) where using a mapping instead of iterating over https://github.com/HuJingwei/merkle-proof/blob/15d12e9b708f087fccecd1efbf47d07f9fcea7a9/src/SimpleMerkleProof.sol#L54 in getNodeData becomes cheaper (maybe it is never the case, I did not really remember the costs).

@hujw77
Copy link

hujw77 commented Nov 5, 2020

It is a good advice, but mappings can only be stored in the storage data location. I will test mapping in the storage maybe cheaper.

@cheme
Copy link
Contributor

cheme commented Nov 5, 2020

mappings can only be stored in the storage

Oh, sounds bad, probably will need to keep array hashes for refund, but still I remember storage ops being so much more expensive. Not sure testing is worth it :)
Probably sorting the array and have faster search be another way to optimize it if needed at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More efficient storage proofs when verifier already knows values
4 participants