Skip to content

MerkleDB -- document codec #1670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 7, 2023
175 changes: 175 additions & 0 deletions x/merkledb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,181 @@ To reduce the depth of nodes in the trie, a `Merkle Node` utilizes path compress
+-----------------------------------+
```

## Serialization

### Node

Nodes are persisted in an underlying database. In order to persist nodes, we must first serialize them.
Serialization is done by the `encoder` interface defined in `codec.go`.

The node serialization format is as follows:

```
+----------------------------------------------------+
| Value existence flag (1 byte) |
+----------------------------------------------------+
| Value length (varint) (optional) |
+----------------------------------------------------+
| Value (variable length bytes) (optional) |
+----------------------------------------------------+
| Number of children (varint) |
+----------------------------------------------------+
| Child index (varint) |
+----------------------------------------------------+
| Child compressed path length (varint) |
+----------------------------------------------------+
| Child compressed path (variable length bytes) |
+----------------------------------------------------+
| Child ID (32 bytes) |
+----------------------------------------------------+
| Child index (varint) |
+----------------------------------------------------+
| Child compressed path length (varint) |
+----------------------------------------------------+
| Child compressed path (variable length bytes) |
+----------------------------------------------------+
| Child ID (32 bytes) |
+----------------------------------------------------+
|... |
+----------------------------------------------------+
```

Where:
* `Value existence flag` is `1` if this node has a value, otherwise `0`.
* `Value length` is the length of the value, if it exists (i.e. if `Value existince flag` is `1`.) Otherwise not serialized.
* `Value` is the value, if it exists (i.e. if `Value existince flag` is `1`.) Otherwise not serialized.
* `Number of children` is the number of children this node has.
* `Child index` is the index of a child node within the list of the node's children.
* `Child compressed path length` is the length of the child node's compressed path.
* `Child compressed path` is the child node's compressed path.
* `Child ID` is the child node's ID.

For each child of the node, we have an additional:

```
+----------------------------------------------------+
| Child index (varint) |
+----------------------------------------------------+
| Child compressed path length (varint) |
+----------------------------------------------------+
| Child compressed path (variable length bytes) |
+----------------------------------------------------+
| Child ID (32 bytes) |
+----------------------------------------------------+
```

Note that the `Child index` are not necessarily sequential. For example, if a node has 3 children, the `Child index` values could be `0`, `2`, and `15`.
However, the `Child index` values must be strictly increasing. For example, the `Child index` values cannot be `0`, `0`, and `1`, or `1`, `0`.

Since a node can have up to 16 children, there can be up to 16 such blocks of children data.

#### Example

Let's take a look at an example node.

Its byte representation (in hex) is: `0x01020204000210579EB3718A7E437D2DDCE931AC7CC05A0BC695A9C2084F5DF12FB96AD0FA32660E06FFF09845893C4F9D92C4E097FCF2589BC9D6882B1F18D1C2FC91D7DF1D3FCBDB4238`

The node's key is empty (its the root) and has value `0x02`.
It has two children.
The first is at child index `0`, has compressed path `0x01` and ID (in hex) `0x579eb3718a7e437d2ddce931ac7cc05a0bc695a9c2084f5df12fb96ad0fa3266`.
The second is at child index `14`, has compressed path `0x0F0F0F` and ID (in hex) `0x9845893c4f9d92c4e097fcf2589bc9d6882b1f18d1c2fc91d7df1d3fcbdb4238`.

```
+--------------------------------------------------------------------+
| Value existence flag (1 byte) |
| 0x01 |
+--------------------------------------------------------------------+
| Value length (varint) (optional) |
| 0x02 |
+--------------------------------------------------------------------+
| Value (variable length bytes) (optional) |
| 0x02 |
+--------------------------------------------------------------------+
| Number of children (varint) |
| 0x04 |
+--------------------------------------------------------------------+
| Child index (varint) |
| 0x00 |
+--------------------------------------------------------------------+
| Child compressed path length (varint) |
| 0x02 |
+--------------------------------------------------------------------+
| Child compressed path (variable length bytes) |
| 0x10 |
+--------------------------------------------------------------------+
| Child ID (32 bytes) |
| 0x579EB3718A7E437D2DDCE931AC7CC05A0BC695A9C2084F5DF12FB96AD0FA3266 |
+--------------------------------------------------------------------+
| Child index (varint) |
| 0x0E |
+--------------------------------------------------------------------+
| Child compressed path length (varint) |
| 0x06 |
+--------------------------------------------------------------------+
| Child compressed path (variable length bytes) |
| 0xFFF0 |
+--------------------------------------------------------------------+
| Child ID (32 bytes) |
| 0x9845893C4F9D92C4E097FCF2589BC9D6882B1F18D1C2FC91D7DF1D3FCBDB4238 |
+--------------------------------------------------------------------+
```

### Node Hashing

Each node must have a unique ID that identifies it. This ID is calculated by hashing the following values:
* The node's children
* The node's value digest
* The node's key

Specifically, we encode these values in the following way:

```
+----------------------------------------------------+
| Number of children (varint) |
+----------------------------------------------------+
| Child index (varint) |
+----------------------------------------------------+
| Child ID (32 bytes) |
+----------------------------------------------------+
| Child index (varint) |
+----------------------------------------------------+
| Child ID (32 bytes) |
+----------------------------------------------------+
|... |
+----------------------------------------------------+
| Value existence flag (1 byte) |
+----------------------------------------------------+
| Value length (varint) (optional) |
+----------------------------------------------------+
| Value (variable length bytes) (optional) |
+----------------------------------------------------+
| Key length (varint) |
+----------------------------------------------------+
| Key (variable length bytes) |
+----------------------------------------------------+
```

Where:
* `Number of children` is the number of children this node has.
* `Child index` is the index of a child node within the list of the node's children.
* `Child ID` is the child node's ID.
* `Value existence flag` is `1` if this node has a value, otherwise `0`.
* `Value length` is the length of the value, if it exists (i.e. if `Value existince flag` is `1`.) Otherwise not serialized.
* `Value` is the value, if it exists (i.e. if `Value existince flag` is `1`.) Otherwise not serialized.
* `Key length` is the number of nibbles in this node's key.
* `Key` is the node's key.

Note that, as with the node serialization format, the `Child index` values aren't necessarily sequential, but they are unique and strictly increasing.
Also like the node serialization format, there can be up to 16 blocks of children data.
However, note that child compressed paths are not included in the node ID calculation.

Once this is encoded, we `sha256` hash the resulting bytes to get the node's ID.

### Encoding Varints and Bytes

Varints are encoded with `binary.PutVarint` from the standard library's `binary/encoding` package.
Bytes are encoded by simply copying them onto the buffer.

## Design choices

### []byte copying
Expand Down