[feature]: backup and recovery

# Background

As a prerequisite for being able to officially support Bitcoin `mainnet` for taproot assets, we need to carefully think about how we approach the question of backup and recovery of `tapd` data, since not only assets might be at stake but also the BTC of the anchoring transaction output (you can't spend the BTC that carries assets without being able to reconstruct the full asset tree).

This issue serves as a collection/brainstorm issue around everything related to data safety, backups and recovery procedures.

## Documentation

Similar to the [`lnd` Operational Safety Guidelines](https://github.com/lightningnetwork/lnd/blob/master/docs/safety.md) document, we'll want a doc that describes the different data sources, what they are used for and how to best prevent loss thereof.
The document should (at least) describe the following key items:
 - What is the relationship between asset public/private keys (e.g. `script_key`s) and `lnd`'s wallet/seed?
 - What data is required in order to recover both the assets and the BTC of a taproot asset output?
 - What data is stored where (`tapd`'s database, `lnd`'s wallet database, `lnd`'s channel database)?
 - Where does `tapd` store its files and which files need to be backed up regularly?
 - How can the `tapd` database be set up in a production ready manner?
 - Recoverability when using a public universe vs. using a private one? (See further below).

## How to prevent database loss

As long as the `tapd` database is fully intact and the seed for the `lnd` wallet is known, all funds are SAFU.
So to have a replicated (or at least regularly backed up) state of the DB should be the highest priority.
We should test and then document the following ways of setting up a database cluster or streaming replication:
 - Using a [Postgres database cluster](https://www.postgresql.org/docs/current/creating-cluster.html) as a database backend: This is already possible and is the recommended way of running `tapd` in a production environment. We'll want to document some setup recommendations and best practices around this though.
 - Add support for low-level SQLite replication support, perhaps using something like https://github.com/benbjohnson/litestream.

## How to recover from full database loss

Even though keeping the `tapd` database intact should always be the highest priority, the reality is that users often don't realize that uninstalling and re-installing an app on platforms like Umbrel causes all data to be deleted. So because we want to ship `tapd` as part of Lightning Terminal, which is available on such platforms, we need to have a strategy for basic recovery of assets and BTC for the case when the full `tapd` database is lost.

Possible approaches:
 - Keep a single file (similar to the [SCB](https://github.com/lightningnetwork/lnd/blob/master/docs/safety.md#static-channel-backups-scbs) file used in `lnd` for static channel information) around that is updated on every mint, send and receive and keeps track of the latest on-chain output and proof chain, as well as the universe information. The file would basically contain all the information to be able to recover the asset and BTC funds, but not the transaction history. Then mobile and other platform apps would only need to make sure to create an off-device backup of that file whenever it changes.
 - When using public multiverses, then the information available in `lnd` could be enough to query those multiverses for the information required to recover access to asset and BTC funds. This requires the `lnd` wallet database to be fully intact though, as some of this information is added to the wallet DB by `tapd` and is not recoverable through a simple `lnd` wallet restore from seed.
    - Query the `lnd` wallet for unspent p2tr outputs that aren't BIP-0086, then look up the multiverse for assets related to those outpoints (this will only work if the asset anchoring transaction has a change output that goes back to the `lnd` wallet, because the actual asset anchoring output will not be recognized as "belonging" to the `lnd` wallet). This will work for asset mints and asset change outputs.
    - Query the `lnd` wallet for any specifically registered tapscript addresses, then look those outpoints up in the multiverse to recover asset proofs. This will work for assets received through taproot asset addresses (non-interactive receives). Though the tapscript addresses aren't directly derived from the seed, so if the `lnd` wallet was recovered from seed, this won't be possible.
 - With v2 addresses a user can fetch encrypted messages from the universe/authmailbox server if the wallet has the script key of the address. The authmailbox server shouldn't delete messages for unspent outputs, so a recovery should always be possible.

### New universe RPCs required for multiverse proof lookup

To allow some of the multiverse lookups described above, we might need additional indexes into the universe/multiverse tree structure:
 - Today we have `assetID => outpoint || scriptKey`
 - Might also need `outpoint => assetID || scriptKey` and `scriptKey => assetID || outpoint`

These new lookup methods might make it easy to enumerate assets in transfers observed by third parties and might therefore not be optimal for privacy. We should attempt to implement all recovery procedures without relying on those new lookup methods.


## Structure of on-chain asset recovery file (Chain Asset Recovery File, CARF?)

We periodically (see triggers below) create a flat file containing:
   - All TAP addresses known to the daemon, including full internal key and script key information (descriptor+locator) and proof courier address.
   - All asset outpoints (on-chain outpoint + asset ID + group key + script key) of all currently unspent asset outputs.
      - We should also attempt to store what universe URL we used to sync each asset, which we currently don't store.
      - Alternatively we can just save all currently configured federation servers to the recovery file.
   - All genesis, meta and group key reveal information for our _owned_ assets.
      - We might need to exclude the actual meta _data_ as that can be up to 1 MiB and make the file very big very quickly. We should be able to sync that data again when recovering, so it shouldn't be lost for good if we exclude it from the recovery file.
   - All internal and script key information (descriptor+locator) to make sure we cover any keys derived for asset channel operations.

The file should be encoded as TLV and encrypted the same way the `lnd` SCB file is (using the special `lnd` key path used specifically for the SCB encryption).

The documentation should be updated to mention how to create a filesystem (`inotifywait`) based trigger to back up the file every time it is changed. Can use this as example: https://gist.github.com/alexbosworth/2c5e185aedbdac45a03655b709e255a3

## Triggers for updating the on-chain asset recovery file

We update (using the same atomic create-new-file-then-swap-in-place mechanism used by `lnd` for the SCB file) the recovery file whenever:
 - We start up the daemon
 - We create a new on-chain address
 - We import a new asset from a received proof
    - Whenever `ImportProof` of the asset store is called, which has the following origin paths (examples, potentially incomplete):
       - `ImportProof` RPC (deprecated and dev-only)
       - `RegisterTransfer` RPC for external (vPSBT) flows
       - `ChainPorter.storeProofs` for change outputs of asset sends or any action performed by `AuxSweeper`
       - `BatchCaretaker.storeMintingProofs` for newly minted assets
       - `Custodian.receiveProof` for new incoming transfers
       - `AuxSweeper.importCommitTx` for force closed asset channels
       - `AuxCloser.FinalizeClose` for coop closed asset channels

Those should be the main events at which a new _owned_ asset enters our database.

Steps to complete:
- [ ] Create a backup subsystem that subscribes to the above notifications and updates a file on disk whenever a new event comes in
- [ ] Allow the full file system path of the above mentioned file to be configured as a config/CLI flag (so it could be on a different file system, like a mounted network file system)

## Draft of recovery procedure

From the most recent available recovery file the user can attempt to restore the assets available at that point.
This _should_ be used on a fresh/empty `tapd` node. But because everything in the database should be implemented using _upserts_, it should theoretically also work on an existing node.
During implementation we should probably consider users testing this on their existing node with data already present in their database. Nothing catastrophic should happen in that case.

Potentially we could even use the recovery file as a lightweight tool to move/migrate assets from one `tapd` instance to another (ONLY if they were connected to the same `lnd` instance of course, since the file WILL NOT CONTAIN ANY KEY MATERIAL, only key descriptor+locator information).

Sketch of recovery procedure:
 - User calls new `RecoverChainAssets` RPC
 - Encrypted recovery file is decrypted using the `lnd` wallet's special SCB key. If decryption fails, the file was created using another `lnd` backing node and the process must be aborted.
 - Upsert all genesis/meta/group key information from the file into the database.
 - Upsert all internal and script key information from the file into the database.
 - Upsert all TAP addresses (including script and internal keys) into the database.
    - For all V0/V1 TAP addresses:
       - Derive on-chain Taproot output key, add to `lnd` wallet, request chain rescan (the rescan is the main part that doesn't happen after importing an address normally, because when we create an address for the first time, we don't expect there to already be outputs for it)
       - For all unspent outputs found by `lnd` after the rescan, query all available universe servers for available proofs (skip hashmail couriers, as those likely won't be available anymore), import found proofs (this is mostly to cover any assets sent to existing addresses in the time between the moment the backup was created and the moment the recovery was issued, all other assets should be covered in the list of asset outpoints processed in the step below).
    - For all V2 TAP addresses (this process should already be kicked off by the `Custodian` after importing a new address):
       - Connect/subscribe to the authmailbox using the script key of the TAP address and the authmailbox address specified by the TAP address' proof courier address.
       - For each message received, import all proofs.
 - For each asset outpoint (on-chain outpoint + asset ID + group key + script key) in the recovery file, attempt to fetch and import the full proof provenance using all available universe servers.

Steps to complete:

- [ ] Create an RPC that takes a backup file and inserts all the information in it, resulting in the addresses/assets/transfers to be fully restored in an empty (optionally also non-empty) node according to above sketched procedure
- [ ] (optional) Create a new RPC that on demand returns the current content of the backup file as a binary blob
- [ ] (optional) Create a new streaming RPC that emits an event whenever the backup notification service signals that the backup file was updated

## Recovery of assets in asset Lightning Channels

As with normal (BTC-only) Lightning Channels, emergency recovery in case of a data loss depends on the peer force closing a channel (and watchtowers providing the incentive to not publish an old state).
The Static Channel Backup (SCB) file created by `lnd` contains the information required to contact a peer through the LN p2p network and requesting a force close of a channel.
The SCB file currently contains all the necessary information (combined with tracking progress on-chain) to then sweep the BTC funds from a remotely force closed channel.

To be able to sweep the assets contained in an asset channel, a recovering node also needs to be able to find the exact asset outpoints (on-chain outpoint + asset ID + group key + script key) of their balance in order to query a universe for the full proof.

There are two ways (maybe more?) in which this can be achieved:
 1. Whenever the p2p connection to a peer is established (`channel_reest` message), we expect them to include the current asset output distribution to be added to the custom TLV part of that message. See https://github.com/lightninglabs/taproot-assets/issues/426#issuecomment-2407084151.
 2. Upgrade the asset channel code to use the `authmailbox` feature to send the list of outputs to the mailbox server (similar to how v2 addresses do that for grouped address receives), using the channel's `to_remote` internal key (`toRemoteTree.InternalKey`) as the encryption/receiver key. Then the `lnd` SCB file simply needs to contain a custom blob with the authmailbox proof courier address and the recovering node can pull the information from there, authenticating itself with the receiver key.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature]: backup and recovery #426

Background

Documentation

How to prevent database loss

How to recover from full database loss

New universe RPCs required for multiverse proof lookup

Structure of on-chain asset recovery file (Chain Asset Recovery File, CARF?)

Triggers for updating the on-chain asset recovery file

Draft of recovery procedure

Recovery of assets in asset Lightning Channels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feature]: backup and recovery #426

Description

Background

Documentation

How to prevent database loss

How to recover from full database loss

New universe RPCs required for multiverse proof lookup

Structure of on-chain asset recovery file (Chain Asset Recovery File, CARF?)

Triggers for updating the on-chain asset recovery file

Draft of recovery procedure

Recovery of assets in asset Lightning Channels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions