RFD 0169: Automatic Updates for Agents #40190

sclevine · 2024-04-03T22:38:01Z

This RFD proposes a new mechanism for Teleport agents to automatically update to a version set by an operator via tctl.

github-actions · 2024-04-03T22:38:35Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

github-actions · 2024-04-03T22:39:21Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

github-actions · 2024-04-03T22:40:32Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

github-actions · 2024-04-03T22:41:05Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

rfd/0169-auto-updates-linux-agents.md

github-actions · 2024-04-04T20:32:47Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

github-actions · 2024-04-04T20:33:45Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

rfd/0169-auto-updates-linux-agents.md

russjones · 2024-04-05T00:18:51Z

rfd/0169-auto-updates-linux-agents.md

+active_version: 15.1.1
+```
+
+### Runtime


Can you create another section called ### Installer?

One of the biggest problems I saw during during our push to get customers to switch to automatic updates was the number of different ways we have to install Teleport.

I'd like us to consolidate on a single way to install agents that 90% of our customers use. Ideally anything we build should work for self-hosted and Cloud similar to client tools updates.

The good news I think you have the answer.

$ apt-get install teleport-updater $ teleport-updater enable --proxy proxy.example.com

This works for Cloud, but would also work for self-hosted customers. We just have to work through the edge cases.

Otherwise, what's left?

We need to make sure all of our install scripts are updated to use the new installation method above, for example the discover script and curl | bash on our downloads page. Please enumerate all the scripts/locations that will need to be updated.

We need to make sure the teleport-updater package can support OSS, Enterprise, and Enterprise FIPs installations.

We need to make sure the teleport-updater is available for all the different architectures we support (I think this is already the case).

We update our documentation to reference the new installation instructions. 1 2 3 4

Dashboard tenants downloads tab is updated to reference new instructions.

Let's make sure that's in-scope for this RFD.

We need to make sure the teleport-updater package can support OSS, Enterprise, and Enterprise FIPs installations.

I think we can put this information into the ping endpoint. Then you no longer have to worry about which binary you need to install (OSS, Enterprise, FIPS Enterprise), you just install the teleport-updater and it downloads and runs this right version.

Something we should consider: the apt-get install && teleport-updater enable approach works fine if you running the commands on the host (or running something like Ansible), what would we recommend to someone that builds an AMI using this new install method?

I think I've covered everything above:

Added new "Installers" section with scripts changes and VM/container image instructions

Added new "Documentation" section with links and notes

New server_edition field is added to ping, along with associated changes to "Runtime" section logic

Let me know if this looks good.

If we make this new installer the default for everything (which I think is a great approach, we must reduce complexity) I would suggest going further by making the new binary the only recommended way of installing teleport, even without Automatic Updates.

We could name it teleport-version-manager or whatever name that doesn't imply it requires AUs and expose the following commands:

tvm update as already described in the RFD

tvm follow/tvm unfollow to follow updates from a proxy, this is the current RFD enable/disable.

tvm install <X.Y.Z|teleport.example.com> would be an additional command not enabling AUs, but installing a static version (given, or from a proxy).

I think this would answer

what would we recommend to someone that builds an AMI using this new install method

We would recommend "install tvm, and run tvm install teleport.example.com (for non-AU setups) or tvm follow teleport.example.com (for AU) on startup".

Would tvm use-version vX.Y.Z or tvm use vX.Y.Z be better?

tvm use sounds better to me, but I don't have a strong opinion.

Happy to rename. I'll wait for a few others to chime in, and change it when we have some consensus -- names can be contentious 🙂

Even though I really want to refactor our installation process, looking at our capacity for Q2, I don't think we can.

So I am in-favor of keeping it simple and solving the issue for Cloud customers first then coming back in Q3 and fixing our installation problems as a whole.

Here is what I am thinking.

We keep the name teleport-updater to not introduce too many changes at once.

Everywhere we are currently installing our auto updater, we only update it to install the new one.

Scope out self-hosted installation.

What do you guys think?

👍 Works for me. Maybe there are two phases here:

Ship this RFD, which minimizes user-facing changes

Re-design UX, simplify scripts, etc.

If we postpone the simplification we must absolutely ensure we do it the next quarter.

The installation methods are already a mess and we're adding new methods, instructions and scripts. This is not maintainable and self-hosted customers are already complaining about how hard it is to just install teleport. This is not an issue we can solve with docs. We've tried for the last year and the installation docs only became worse and more obscure since I joined Teleport 2 years ago.

I'm against adding new ways without removing the old ones but I fear that we'll do it anyway.

Why did we scope out self-hosted installations?
Is there extra work for those scenarios? Except for docs around the autoupdate_version resource?

Adding more complexity to distinguish between cloud/self-hosted in the installer scripts seems we are shifting the complexity to something which is way hard to test/maintain.

rfd/0169-auto-updates-linux-agents.md

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>

github-actions · 2024-04-05T02:17:00Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>

github-actions · 2024-04-05T02:20:26Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>

github-actions · 2024-04-05T02:24:27Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

zmb3

First pass. I think I need to read it a few more times to get up to speed, but sometimes feedback from someone seeing something for the first time is helpful, so here you go.

zmb3 · 2024-09-09T19:26:00Z

rfd/0169-auto-updates-linux-agents.md

+Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`.
+The version and edition served from that endpoint will be configured using new `autoupdate_version` resource.
+
+Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on:


What exactly is Teleport updater? (I assume this is a new tool we're building, but this is my first read through of the RFD and the term was introduced without context)

If it is a new program, do we need a separate program? Managing the version of some updater and the version of Teleport simultaneously has proven difficult. Have we considered exposing the updater as a subcommand of Teleport?

If the updater itself was part of Teleport, then the updater could use the agent's own certificates to authenticate itself and we wouldn't need to surface upgrade instructions over an unauthenticated channel.

What exactly is Teleport updater?

Sorry, this was more clear when the Runtime section was earlier in the document. This is a separate binary, in tools/teleport-update. The binary is versioned with Teleport, but may be distributed separately from the other Teleport binaries (similar to tbot or tsh).

do we need a separate program?

The updater is intentionally not part of the teleport binary for a few reasons:

teleport-update is distributed separately from teleport, and without any of the other binaries, to bootstrap an initial agent installation. Including all of Teleport would mean that security scanners would match all Teleport dependencies in the bootstrap/update artifact that could be vulnerable, so customers would have to handle rebuilding artifacts (e.g., AMIs) more often.

teleport-update will also handle long-running tbot updates in the future, and I hesitate to distribute teleport with tbot.

teleport-update's API contract with the cluster must be extremely stable. The first GA version of teleport-update will need to work with Teleport v14+, and should ideally work with all future versions of Teleport. This seems (a bit) easier to manage with a separate binary.

then the updater could use the agent's own certificates to authenticate itself and we wouldn't need to surface upgrade instructions over an unauthenticated channel.

We will not always have an authenticated connection to auth to receive these instructions. E.g., the agent could be a failed state due to a botched upgrade, temporarily stopped, or newly installed. In the future, tbot-only installations may have expired certificates. Keeping this unauthenticated reduces complexity. The information exposed is minimal, and the API has been reviewed by Doyensec.

(Will add this context to the RFD.)

zmb3 · 2024-09-09T19:49:08Z

rfd/0169-auto-updates-linux-agents.md

+
+The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events.
+
+The following data related to the rollout are stored in each instance heartbeat:


Do all agents already perform instance heartbeats?

rfd/0169-auto-updates-linux-agents.md

zmb3 · 2024-09-09T19:59:47Z

rfd/0169-auto-updates-linux-agents.md

+
+```shell
+$ apt-get install teleport-ent-updater
+$ teleport-update enable --proxy example.teleport.sh


Just so I'm clear, teleport-update enable would actually download teleport if it's not already installed?

By default, yes. This ensures the configuration is valid, and that further updates can be applied. We may want to provide a --no-update flag in the future, so e.g., AMIs won't need changes to cloud-init to do the initial installation.

zmb3 · 2024-09-09T20:02:27Z

rfd/0169-auto-updates-linux-agents.md

+updates.yaml:
+```
+version: v1
+kind: agent_versions


This looks an awful lot like a teleport resource, but it seems this is just a configuration file for the updater? If so, I might not use the same YAML structure with a version and kind to avoid people thinking that this is a Teleport resource that they need to tctl create. (Similar to how our access plugins or event handler are configured)

@russjones requested this format explicitly

Original thread: #40190 (comment)

(I feel like it's confusing as well)

zmb3 · 2024-09-09T20:03:33Z

rfd/0169-auto-updates-linux-agents.md

+  active_version: 15.1.1
+```
+
+backup.yaml:


What is this supposed to do? I'm not sure if I missed something above or just haven't got to it yet, but I'm finding it hard to comprehend with just a snippet of YAML and no other commentary.

Additionally, when I see kind: db_backup nothing about that name suggests that this is in any way related to agent auto upgrades. Is that intended?

Thanks, I'll make this more explicit in the section below.

It's metadata about the sqlite DB backup, referred to here: https://github.com/gravitational/teleport/pull/40190/files#diff-4fb277bd3d884c7b045e7efad61111ac4e7957c063e0c0c9281f4e74fd72997fR571

(Users don't need to worry about this file -- it's an implementation detail.)

nothing about that name suggests that this is in any way related to agent auto upgrades. Is that intended?

This file is not a Teleport resource (see #40190 (comment)). It's machine-local configuration storage for individual Teleport backups. That said, the metadata is descriptive of any backup of Teleport's sqlite DB, so it could describe non-agent data.

sclevine · 2024-09-30T16:27:31Z

rfd/0169-auto-updates-linux-agents.md

+The `autoupdate_agent_plan` spec is owned by the Teleport cluster administrator.
+In Teleport Cloud, this is the cloud operations team. For self-hosted setups this is the user with access to the local
+admin socket (tctl on local machine).
+
+> [!NOTE]
+> This is currently an anti-pattern as we are trying to remove the use of the local administrator in Teleport.
+> However, Teleport does not provide any role/permission that we can use for Teleport Cloud operations and cannot be
+> granted to users. To part with local admin rights, we need a way to have cloud or admi-only operations.
+> This would also improve Cloud team operations by interacting with Teleport API rather than executing local tctl.
+> 
+> Solving this problem is out of the scope of this RFD.


Let's defer this note to the Cloud RFD

sclevine · 2024-09-30T16:29:10Z

rfd/0169-auto-updates-linux-agents.md

@@ -479,9 +536,12 @@ However, any changes to `agent_schedules` that occur while a group is active wil
 Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates.

 Note that the `default` schedule applies to agents that do not specify a group name.
+[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?]


We got feedback that the boolean was confusing during the feedback sessions. No strong opinion. How about last-as-default, but only if no default group is explicitly defined?

sclevine · 2024-09-30T16:34:56Z

rfd/0169-auto-updates-linux-agents.md

        days: [ “Sun”, “Mon”, ... | "*" ]
        # start_hour specifies the hour when the group may start upgrading.
        #  default: 0
        start_hour: 0-23
        # wait_days specifies how many days to wait after the previous group finished before starting.
        #  default: 0
        wait_days: 0-1
+        # TODO: is this needed? In which case a customer would need to set a custom jitter?


Okay to drop the configuration option for now. Thought it might be important for self-hosted customers with many agents. I would like the jitter to continue to be served from the cluster in case we need to adjust it.

sclevine · 2024-09-30T16:35:51Z

rfd/0169-auto-updates-linux-agents.md

@@ -454,23 +512,22 @@ spec:
  agent_schedules:
    regular:
    - name: default
-      days: ["*"]
+      days: ["*"] # TODO: restrict to work week? Minus Friday?


Was assuming the restriction would be a Cloud-only setting, and self-hosted customers would have full control.

sclevine · 2024-09-30T16:36:58Z

rfd/0169-auto-updates-linux-agents.md

+- the installation process is complex and users end up installing the wrong version of Teleport
+- the current update process does not provide safeties to protect against broken updates
+- many customers are not adopting the existing updater because they want to control when updates happen
+- we don't offer a ni


(missing words)

rosstimothy · 2024-10-01T20:40:36Z

rfd/0169-auto-updates-linux-agents.md

+I create a new deployment, with a broken version. The version is deployed to the canaries.
+The canaries crash, the updater reverts the update, the agents connect back online and
+advertise they rolled-back. The maintenance is stuck until the canaries are running the target version.


There is zero mention of a canary prior to this and until I scrolled down and realized that canaries were now part of the upgrade plan I had no idea what this text was talking about. Without having a high level understanding for how this system is supposed to work some of this UX section is a bit hard to grok.

This RFD is currently being used for Hugo and I to collaborate. We'll have a new RFD open before Thursday.

sclevine · 2024-10-03T19:24:19Z

Continued in #47126

Create 0169-auto-updates-linux-agents.md

33031d8

github-actions bot requested review from fspmarshall and lxea April 3, 2024 22:38

github-actions bot added rfd Request for Discussion size/md labels Apr 3, 2024

sclevine requested review from russjones and bernardjkim April 3, 2024 22:39

Fix github handle

c453124

Fix Github handle

796fa9e

michellescripts self-requested a review April 4, 2024 18:15

bernardjkim reviewed Apr 4, 2024

View reviewed changes

rfd/0169-auto-updates-linux-agents.md Show resolved Hide resolved

rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved

rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved

Clarify jitter flag

1b75941

Remove time question

e2811de

russjones reviewed Apr 5, 2024

View reviewed changes

Update rfd/0169-auto-updates-linux-agents.md

a119c60

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>

Update rfd/0169-auto-updates-linux-agents.md

2a8cdc7

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>

Update rfd/0169-auto-updates-linux-agents.md

1f3278d

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>

russjones added the no-changelog Indicates that a PR does not require a changelog entry label Apr 5, 2024

sclevine added 3 commits April 5, 2024 12:00

Update 0169-auto-updates-linux-agents.md

05aad92

Update 0169-auto-updates-linux-agents.md

ed4780d

Update 0169-auto-updates-linux-agents.md

5bb6056

sclevine added 3 commits September 4, 2024 11:25

security feedback + naming adjustments

9bcd324

tweak rollout paging

e748820

tweak rollout paging again

4f93a7f

sclevine mentioned this pull request Sep 6, 2024

Proof-of-concept draft for teleport-update binary implementation #46357

Closed

zmb3 reviewed Sep 9, 2024

View reviewed changes

sclevine mentioned this pull request Sep 10, 2024

Add teleport-update binary scaffolding and disable command #46418

Merged

sclevine and others added 11 commits September 9, 2024 22:19

feedback

aff1df3

adjust update.yaml to match implementation feedback

c91977f

wip - new model

ec8d675

canaries

7c89fb6

canary 2

2b95f8e

describe state, transitions, and proxy response

ce6de47

rpcs

26c43b0

finish rpcs

7d0f618

minor tweaks

69d758c

Add user stories

430b7a4

Put new requirements at the top + edit UX + add TODOs

4ac0e9c

sclevine commented Sep 30, 2024

View reviewed changes

rosstimothy reviewed Oct 1, 2024

View reviewed changes

sclevine marked this pull request as draft October 1, 2024 20:56

hugoShaka and others added 6 commits October 2, 2024 12:34

Edition work

e87e3dc

cleanup + swap phases 1 and 2

fecefc7

Move protobuf

e7b1c10

Add installation scenarios

2a5515e

cleanup + move backpressure formulas

4b33f2a

more cleanup

0a0d658

sclevine mentioned this pull request Oct 3, 2024

RFD 0184: Automatic Updates for Teleport Agents #47126

Merged

sclevine closed this Oct 3, 2024

sclevine mentioned this pull request Oct 15, 2024

[teleport-update] Add enable command #47565

Merged


		The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events.

		The following data related to the rollout are stored in each instance heartbeat:

RFD 0169: Automatic Updates for Agents #40190

RFD 0169: Automatic Updates for Agents #40190

Conversation

sclevine commented Apr 3, 2024 • edited Loading

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 4, 2024

github-actions bot commented Apr 4, 2024

russjones Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

russjones Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hugoShaka Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

sclevine Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hugoShaka May 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 5, 2024

github-actions bot commented Apr 5, 2024

github-actions bot commented Apr 5, 2024

zmb3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sclevine Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sclevine Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sclevine commented Oct 3, 2024

sclevine commented Apr 3, 2024 •

edited

Loading

russjones Apr 5, 2024 •

edited

Loading

russjones Apr 5, 2024 •

edited

Loading

hugoShaka Apr 18, 2024 •

edited

Loading

sclevine Apr 19, 2024 •

edited

Loading

hugoShaka May 17, 2024 •

edited

Loading

sclevine Sep 9, 2024 •

edited

Loading

sclevine Sep 10, 2024 •

edited

Loading