Skip to content

Latest commit

 

History

History
148 lines (105 loc) · 6.24 KB

20190909-storage-node-automatic-updater.md

File metadata and controls

148 lines (105 loc) · 6.24 KB

tags: []

Storage Node Automatic Updater

Abstract

Automatic Updater is a process that downloads the latest Storage Node binary and replaces the currently running one.

Background

As more Storage Node Operators join the network we not keep their nodes up to date. If a Storage Node doesn't meet the minimum version required by the satellites they will no longer be able to offer services to the network. Currently we are using Docker for updates, but due to it's limitations with certain OS-s we need a better solution.

The Updater has several responsibilities:

  1. Figure out whether something needs to be updated with gradual rollout.
  2. Safely download the binaries.
  3. Safely update the binaries.
  4. Safely restart the binaries.

Design

The Updater has several steps it takes, contact version server, download, update, restart.

Checking for updates

Update check will regularly, with jitter, contact Version Server, which responds with a message:

{
  "processes": {
    "storagenode": {
      "allowed_version_minimum": "0.3.4",
      "suggested_version": "0.5.1",
      "rollout": {
        "active": true,
        "target_version": "0.5.2",
        "rollout_seed": "04123bacde",
        "rollout_cursor": "40"
      }
    }
  }
}

Independent of an active rollout, a process will confirm that it at least meets the allowed version minimum. if it does not, it will proceed to upgrade to at least the suggested_version if it is not part of a rollout.

When there is a newer version is available it needs to calculate whether it needs to update. To check whether rollout has reached this node it needs to calculate hash(rollout_seed, node_id) < rollout_cursor. This exact behavior may differ for canary nodes, which always get the latest version.

Canary releases

Canary nodes are the first storage nodes that will receive a new release. They will receive it even before first nodes in the gradual rollout. Canary nodes are on the front line on the risk of receiving a bad new release. Only after confirming that canary nodes are behaving correctly after the update, the gradual rollout will begin.

It shall be possible for Storage Node Operator to opt-in to the canary release channel. We expect that those will be mostly Storj employees and enthusiastic community members.

Canary nodes will query the Version Server on a different designated URL to find the latest canary version.

Downloading the binaries

Once we have decided on a new version we need to download the new version. We will download the appropriate release from a trusted server beside the current binary (instead of temporary directory).

Once we have successfully downloaded we must verify that the binary signature is valid.

Possible problems:

  • Downloading could fail.
  • Out-of-space for downloading.
  • Filesystem is read-only.
  • Corrupted binary.
  • Man in the Middle attacks.
    • Verify binary hashes and the binary signature.
  • Downloaded file may be quarantined by the anti-virus or blocked by the firewall.

Updating the binaries

To update the binaries we can take two approaches.

  1. Rename storagenode.exe into storagenode.old.<release>.exe.
  2. Rename storagenode.<release>.exe into storagenode.exe.
  3. Restart the service using Windows API.
  4. Delete storagenode.old.<release>.exe.

Alternatively this could be:

  1. Stop the service using Windows API.
  2. Rename storagenode.exe into storagenode.old.<release>.exe
  3. Rename storagenode.<release>.exe into storagenode.exe.
  4. Start the service using Windows API.
  5. Delete storagenode.old.<release>.exe.

Usually automatic updaters prefer the first approach because it allows for inplace updating of the same binary that is doing the updating.

Possible problems:

  • Computer crashes during swapping.
    • Automatic updater checks binary version and reruns download/swap steps.
  • Deletion/stopping of the old binary fails.
  • Out-of-space during migrations.
  • Failure to start.
  • Not yet configured.
    • Storage node will run a setup or describe out to fix the problem...?
  • Anti-virus or other protection prevents new binary from starting.

If the service fails to start then we should try to report and/or correct the issue.

Rollbacks

There will be cases when, despite our best efforts, we will release a bad version. In such case, storage nodes which got the update will malfunction.

We won't support rolling back to the previous version.

To mitigate the risk, we will have canary releases and gradual release rollout. If a problem with the new version is identified, we will:

  1. Stop the canary releases - set the canary channel to the previous last known good version.
  2. Stop the gradual release rollout.
  3. Prepare a new patch version with the fix.
  4. Update the canary channel to the new patch version.
  5. If canary nodes report successful fix, restart the gradual release rollout with the new patch version.

Implementation

Initially, we can implement a basic version of the auto-updater that matches the docker watchtower in features, so we can start supporting Windows Home sooner than waiting to implement all features described in this document.

Basic auto-updater:

  • Update the version server to return suggested version for storage nodes.
  • Create an automatic updater that just checks the suggested version and updates the binary.
  • Make automatic updater part of installer.
  • Test the system with anti-viruses and firewalls.

Later we can add:

  • Canary releases
    • Update the version server with a canary release channel.
    • Update the auto-updater with a feature to opt-in to the canary release channel.
  • Rollout updates
    • Update the version server to include rollout information.
    • Update the auto-updater to check the rollout information.
    • Write document how to start and stop rollouts.

Open issues

  • Should we try to update new and small nodes first to further mitigate the impact of bad releases?
    • Storage Node Operators of new nodes are expected to check their logs more frequently.
    • Is this possible at all using the jitter?
  • We need to define the Web API of the Version Server.