Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 0169: Automatic Updates for Agents #40190

Closed
wants to merge 84 commits into from
Closed
Changes from 5 commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
33031d8
Create 0169-auto-updates-linux-agents.md
sclevine Apr 3, 2024
c453124
Fix github handle
sclevine Apr 3, 2024
796fa9e
Fix Github handle
sclevine Apr 3, 2024
1b75941
Clarify jitter flag
sclevine Apr 4, 2024
e2811de
Remove time question
sclevine Apr 4, 2024
a119c60
Update rfd/0169-auto-updates-linux-agents.md
sclevine Apr 5, 2024
2a8cdc7
Update rfd/0169-auto-updates-linux-agents.md
sclevine Apr 5, 2024
1f3278d
Update rfd/0169-auto-updates-linux-agents.md
sclevine Apr 5, 2024
05aad92
Update 0169-auto-updates-linux-agents.md
sclevine Apr 5, 2024
ed4780d
Update 0169-auto-updates-linux-agents.md
sclevine Apr 5, 2024
5bb6056
Update 0169-auto-updates-linux-agents.md
sclevine Apr 5, 2024
74a452e
add editions
sclevine Apr 5, 2024
63c9a35
Installers and docs
sclevine Apr 8, 2024
a0a912f
Update 0169-auto-updates-linux-agents.md
sclevine Apr 8, 2024
6371c82
Update 0169-auto-updates-linux-agents.md
sclevine Apr 8, 2024
1022633
Update 0169-auto-updates-linux-agents.md
sclevine Apr 8, 2024
af20fe2
Update 0169-auto-updates-linux-agents.md
sclevine Apr 8, 2024
7fd207d
Update 0169-auto-updates-linux-agents.md
sclevine Apr 8, 2024
27774cb
Downgrades
sclevine Apr 15, 2024
57fc557
Feedback
sclevine May 13, 2024
bc28150
Update 0169-auto-updates-linux-agents.md
sclevine May 13, 2024
3da6525
Remove last working copy of teleport
sclevine May 13, 2024
4a81d9d
add step to ensure free disk space
sclevine May 13, 2024
da27831
Typos
sclevine May 13, 2024
994865d
Update 0169-auto-updates-linux-agents.md
sclevine May 23, 2024
052c490
Update 0169-auto-updates-linux-agents.md
sclevine May 23, 2024
be4956b
feedback
sclevine May 28, 2024
c1784a7
Update 0169-auto-updates-linux-agents.md
sclevine May 28, 2024
511bf59
Update 0169-auto-updates-linux-agents.md
sclevine May 29, 2024
a1316cd
apt purge
sclevine May 29, 2024
f6bab8b
Only enable auto-upgrades if successful
sclevine May 29, 2024
6f55658
reentrant lock
sclevine May 29, 2024
d3e5b09
reset
sclevine May 29, 2024
3555212
Update 0169-auto-updates-linux-agents.md
sclevine May 31, 2024
f820b52
add note on backups
sclevine Jun 4, 2024
88bdda4
Update 0169-auto-updates-linux-agents.md
sclevine Jun 6, 2024
f98258c
Update 0169-auto-updates-linux-agents.md
sclevine Jun 6, 2024
00a1ea0
Clarify restore/rollback process and validations
sclevine Jun 10, 2024
7dd1144
Added section on logging
sclevine Jun 10, 2024
345d103
Add schedules
sclevine Jul 9, 2024
a022fd5
immediate schedule + note on cycles and chains
sclevine Jul 9, 2024
9e6090f
more details, more tctl commands
sclevine Jul 10, 2024
3f5721c
Update 0169-auto-updates-linux-agents.md
sclevine Jul 11, 2024
46a7a2a
scalability
sclevine Jul 29, 2024
6f62e3d
df
sclevine Jul 29, 2024
b86a1ce
content-length
sclevine Jul 29, 2024
7587fa5
cache init
sclevine Jul 29, 2024
0e90455
binary
sclevine Jul 29, 2024
3cabeb8
more rollout mechanism changes
sclevine Aug 2, 2024
0d492f8
scalability
sclevine Aug 7, 2024
de53461
more scalability
sclevine Aug 7, 2024
34a82cd
use 100kib pages for plan
sclevine Aug 8, 2024
5a62d6b
Add RPCs, tweak API design
sclevine Aug 13, 2024
0362cd1
clarify wording
sclevine Aug 13, 2024
1070927
wording
sclevine Aug 13, 2024
7b384ff
Update rfd/0169-auto-updates-linux-agents.md
sclevine Aug 13, 2024
4b03c02
Update rfd/0169-auto-updates-linux-agents.md
sclevine Aug 13, 2024
beb7c97
linting
sclevine Aug 13, 2024
a6403ee
Move all RPCs into autoupdate/v1
sclevine Aug 22, 2024
568e0fe
Move groups to MVP
sclevine Aug 26, 2024
797b790
note about checksum
sclevine Aug 26, 2024
1b90a34
typos, consistency
sclevine Aug 27, 2024
dc20017
clarify binary is teleport-update, package is teleport-ent-updater
sclevine Aug 27, 2024
c065060
switch from df to unix.Statfs
sclevine Aug 28, 2024
9bcd324
security feedback + naming adjustments
sclevine Sep 4, 2024
e748820
tweak rollout paging
sclevine Sep 6, 2024
4f93a7f
tweak rollout paging again
sclevine Sep 6, 2024
aff1df3
feedback
sclevine Sep 10, 2024
c91977f
adjust update.yaml to match implementation feedback
sclevine Sep 10, 2024
ec8d675
wip - new model
sclevine Sep 24, 2024
7c89fb6
canaries
sclevine Sep 25, 2024
2b95f8e
canary 2
sclevine Sep 25, 2024
ce6de47
describe state, transitions, and proxy response
hugoShaka Sep 25, 2024
26c43b0
rpcs
sclevine Sep 25, 2024
7d0f618
finish rpcs
sclevine Sep 25, 2024
69d758c
minor tweaks
sclevine Sep 27, 2024
430b7a4
Add user stories
hugoShaka Sep 30, 2024
4ac0e9c
Put new requirements at the top + edit UX + add TODOs
hugoShaka Sep 30, 2024
e87e3dc
Edition work
hugoShaka Oct 2, 2024
fecefc7
cleanup + swap phases 1 and 2
sclevine Oct 2, 2024
e7b1c10
Move protobuf
hugoShaka Oct 2, 2024
2a5515e
Add installation scenarios
hugoShaka Oct 2, 2024
4b33f2a
cleanup + move backpressure formulas
sclevine Oct 3, 2024
0a0d658
more cleanup
sclevine Oct 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
267 changes: 267 additions & 0 deletions rfd/0169-auto-updates-linux-agents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
---
authors: Stephen Levine (stephen.levine@goteleport.com)
state: draft
---

# RFD 0169 - Automatic Updates for Linux Agents

## Required Approvers

* Engineering: @russjones && @bernardjkim
sclevine marked this conversation as resolved.
Show resolved Hide resolved
* Security: @reedloden
hugoShaka marked this conversation as resolved.
Show resolved Hide resolved

## What

This RFD proposes a new mechanism for Teleport agents installed on Linux servers to automatically update to a version set by an operator via tctl.

The following anti-goals are out-of-scope for this proposal, but will be addressed in future RFDs:
- Analogous adjustments for Teleport agents installed on Kubernetes
- Phased rollouts of new agent versions for agents connected to an existing cluster
- Signing of agent artifacts via TUF
- Teleport Cloud APIs for updating agents
hugoShaka marked this conversation as resolved.
Show resolved Hide resolved

This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217.

Additionally, this RFD parallels the auto-update functionality for client tools proposed in https://github.com/gravitational/teleport/pull/39805.

## Why

The existing mechanism for automatic agent updates does not provide a hands-off experience for all Teleport users.

1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades or confusing command output.
2. The use of system package management requires complex logic for each target distribution.
3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport.
4. The use of bash to implement the updater makes changes difficult and prone to error.
5. The existing auto-updater has limited automated testing.
6. The use of GPG keys in system package managers has key management implications that we would prefer to solve with TUF in the future.
7. The desired agent version cannot be set via Teleport's operator-targeted CLI (tctl).
8. The rollout plan for the new agent version is not fully-configurable using tctl.
9. Agent installation logic is spread between the auto-updater script, install script, auto-discovery script, and documentation.
10. Teleport contains logic that is specific to Teleport Cloud upgrade workflows.
11. The existing auto-updater is not self-updating.
12. It is difficult and undocumented to automate agent upgrades with custom automation (e.g., with JamF).

We must provide a seamless, hands-off experience for auto-updates that is easy to maintain.

## Details

We will ship a new auto-updater package written in Go that does not interface with the system package manager.
It will be versioned separately from Teleport, and manage the installation of the correct Teleport agent version manually.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will the code for tool live? I'm strongly of the opinion that if it's not versioned with Teleport then it shouldn't be in this repo or teleport.e, rather, it should be in it's own repo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The teleport-updater binary would also ship in the Teleport tarball, so it could be versioned with Teleport.

That said, it might nice if the teleport-upgrader package was versioned separately from Teleport, since the package version won't match the version of Teleport installed (or the version of the updater that is updated by the updater, after the initial Teleport update).

I could see this working either way. No strong opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused here - the RFD says:

It will be versioned separately from Teleport, and manage the installation of the correct Teleport agent version manually.

But here you've said:

The teleport-updater binary would also ship in the Teleport tarball

and

it might nice if the teleport-upgrader package was versioned separately from Teleport

Is the tool going to be versioned with Teleport or not? If not, and if it'll be used to install Teleport, then why would it ship inside a Teleport tarball?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should version the updater with Teleport. That's how we version all assets today. Let's stick to that scheme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russjones is there a benefit to keeping this versioned with Teleport? This has caused problems in the past (in fact I spent hours today working on a problem caused by this), and there is precedence for not versioning customer-facing tools/products with Teleport (see shared-workflows tools, the projects we've forked, and TAG).

Here are the benefits I see of pulling this code out of teleport/teleport.e:

  • It cuts down on release time. If we pulled this out today then every Teleport releases would take 15-30m less.
  • It keeps the code from being accidentally coupled with Teleport code. This has bitten me several times in the past two weeks alone.
  • It allows us to release updater changes separately, and makes it more clear to customers when something changes. Updater code has historically changed very infrequently, however customers think that it's changed on every release because we release it with Teleport.
  • Changes to the updater will not need to be backported, unless they are specifically needed for an older version of the updater. The subtle but important difference is that because the updater would likely have few major releases, this would not need to be done for every change.
  • It integrates well with the security model that our tools (i.e. GitHub) uses. A compromise to this code in a separate repo has a much smaller impact to Gravitational than if there was a compromise in teleport/teleport.e, as the GH Environments in these repos have access to do significantly more damage.
  • It would allow for faster iteration for both the tool itself, and Teleport. More code in the teleport/teleport.e code bases results in more merge contention (which also indirectly results in increased CI costs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey folks this discussion somehow disappeared, so I'm commenting here in hopes that it shows back up.

It will read the unauthenticated `/v1/webapi/ping` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan.
It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`.
hugoShaka marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure it is symlinked from /usr/local/bin

This shouldn't be under /usr/local - the FHS specifically states that the contents of /usr/local "needs to be safe from being overwritten when the system software is updated". A more appropriate directory would be /usr/bin or /usr/sbin, depending on whether or not we consider these binaries to be "used exclusively by the system administrator".

Copy link
Member Author

@sclevine sclevine Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read "system software" as software installed and upgraded via an OS-provided tool like a package manager. I view these updater-managed versions of Teleport as installed by a user who invoked teleport-updater enable to create a local installation of Teleport.

/usr/local "needs to be safe from being overwritten when the system software is updated"

For this RFD, updating system software using the OS package manager will not touch /usr/local.

I selected /usr/local over /usr to conform with:

Software placed in / or /usr may be overwritten by system upgrades (though we recommend that distributions do not overwrite data in /etc under these circumstances). For this reason, local software must not be placed outside of /usr/local without good reason.

Seems better not to conflict at a file-level with the non-auto-updating teleport package, which is system software that has a binary in /usr/bin.

Not strongly opinionated, but the OS won't be tracking these files via the package DB, so /usr/local/bin seems closer to the spec to me. Also, it's not unheard of for /usr to be read-only outside of system updates, and /usr/local to remain read-write.

https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch04s09.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read "system software" as software installed and upgraded via an OS-provided tool like a package manager

I agree. However this whole tool/endpoints/backing infra (on our end) is a package manager system.

I selected /usr/local over /usr to conform with:

Software placed in / or /usr may be overwritten by system upgrades

Right - which is exactly why we should place it there. This RFD is for part of a package management system, and one of the functions of this system is to provide Teleport upgrades.

Seems better not to conflict at a file-level with the non-auto-updating teleport package, which is system software that has a binary in /usr/bin.

IMO in this case we should either:

  • Advise Teleport admins to uninstall the normal Teleport package, or
  • Forcefully uninstall Teleport installed by a package system manager, or
  • Somehow adopt the current installation

it's not unheard of for /usr to be read-only outside of system updates, and /usr/local to remain read-write

If customers complain about this then we could provide an override for installation directory via the config file. I haven't seen a ton of cases (not that they don't exist) of /usr being read only with /usr/local still writable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to poll some customers and get them to weight in on this (and possibly some other updater implementation details).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prefix at /usr is traditionally under the exclusive control of the system package manager, with /usr/local being for manually installed programs and /opt/<something> for non-distro packages of various kinds. I would not expect a third party package manager to put anything in /usr/bin or /usr/lib.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey folks, this customer ticket just got added to Internal Tools goals for next quarter. Given that this shows at least one customer making the same argument that I am, I'm reopening this for discussion. I'd like to apply the same decision (one way or the other) to both this RFD and the customer's ticket.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The customer is arguing that the teleport package should install the teleport binary to /usr/bin instead of /usr/local/bin, which I agree better conforms to the LSB spec.

Analogously, the upgrader installed by the upgrader package should live in /usr/bin/teleport-updater.

However, the teleport binaries that are installed by the user invoking teleport-updater commands are not managed by the system package manager, so it would not match the spec for them to land in /usr/bin.

Copy link
Contributor

@fheinecke fheinecke Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter whether a package comes from a package manager provided by the distro, or one that we write. To quote this customer's ticket:

The FHS specifies that /usr/local is for manually installed software, not for packages managed by a package manager

We've established that Teleport's auto updates system is a package management system. teleport-updater is the local package manager. When the command is invoked, either manually or automatically (as is the primary intention of this work), it installs, upgrades, and removes teleport packages. This is not the user building Teleport themselves and cping the local build to a bin directory - this is our software managing, end to end, the lifecycle of our packages. Therefore, the binaries should be placed under /usr/bin.

On top of all of that, the linked ticket shows that customers expect teleport to be in /usr/bin. It shouldn't matter to our customers how they install Teleport on their OS. The behavior (including the location of files) should be the same, regardless of installation method (for a given OS). Customers are telling us that they don't want teleport installed to this path, so we shouldn't be installing it to this path regardless of installation method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've established that Teleport's auto updates system is a package management system.

I disagree that the FHS intends to suggest that any software that installs binaries counts as package manager. By this definition, make would be a packager manager, and make install should also install into /usr/bin.

the linked ticket shows that customers expect teleport to be in /usr/bin

The linked ticket is referring to the Teleport system package, not a user-configured updater that installs Teleport from tarballs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that the FHS intends to suggest that any software that installs binaries counts as package manager

I'm in agreement about this. FHS makes no mention of package managers at all. However by any reasonable definition of a package manager, teleport-updater is one, as a part of Teleport's auto updates package management system.

Here are several definitions of a "package manager" from several independent sources:

By any of these definitions the auto updates system is a package management system, and the teleport-updater tool is a package manager.

The linked ticket is referring to the Teleport system package, not a user-configured updater that installs Teleport from tarballs.

Of course you're correct here - it is in reference to the current Teleport package, and could not be in reference to a new system that has yet to be developed or deployed to customers. If you replace RPM(\/| and )DEB with teleport-updater, then it's clear that this issue will apply when the teleport-updater package manager is updating/installing teleport as well.

sclevine marked this conversation as resolved.
Show resolved Hide resolved

### Installation

```shell
$ apt-get install teleport-ent-updater
hugoShaka marked this conversation as resolved.
Show resolved Hide resolved
hugoShaka marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we still need a way to bootstrap the first installation.
Something that adds teleport repo public key, the repo metadata and then installs and runs the updater.
Is that going to be a new script that will be used everywhere we install teleport?
We already have the oneoff script which could serve this purpose (with some little modifications).

$ teleport-update enable --proxy example.teleport.sh
sclevine marked this conversation as resolved.
Show resolved Hide resolved

# if not enabled already, configure teleport and:
$ systemctl enable teleport
sclevine marked this conversation as resolved.
Show resolved Hide resolved
```

### API

#### Endpoints

`/v1/webapi/ping`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a soft-onboarding, I would assume it would be possible for us to host our own "version" server to return this payload?

i.e:

  • we rely on the new golang updater service to ping our version server and install the tarballs from teleport
  • instead of relying on the CP endpoint

This once again would allow us to opt into updates for certain instances whilst not updating others (until we have the ability to bucket instances)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thameezb For your use-case, I would recommend sticking your current self-managed flow. Only change I would recommend is using tctl autoupdate watch to watch for updates and push that version out to your repositories.

You can switch over to the new system once we have bucketed rollout.

What do you think?

Copy link
Contributor

@thameezb thameezb May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push that version out to your repositories.

Would the new teleport packages still be published to the existing teleport stable/cloud repos? As currently we use the standard teleport-ent-updater (except that it points to our own version server and not teleport's version server or our CP)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the new teleport packages still be published to the existing teleport stable/cloud repos?

If possible it'd be great to not publish/reduce how much we publish to that component. It would reduce publishing time quite a bit.

Copy link
Contributor

@thameezb thameezb May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then our current flow would break (as current version of teleport-ent-updater uses stable/cloud to install packages).

There would need to be a better migration path until bucketing is supported.

```json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fheinecke not sure why comments are not being added to the thread above, but we would require the packages to exist on stable/cloud as that is what teleport-ent-updater uses

{
"agent_version": "15.1.1",
"agent_auto_update": true,
"agent_update_after": "2024-04-23T18:00:00.000Z",
"agent_update_jitter_seconds": 10,
}
sclevine marked this conversation as resolved.
Show resolved Hide resolved
```
Notes:
- Critical updates are achieved by serving `agent_update_after` with the current time.
- The Teleport proxy translates upgrade hours (below) into a specific time after which all agents should be upgraded.
- If an agent misses an upgrade window, it will always update immediately.
sclevine marked this conversation as resolved.
Show resolved Hide resolved

#### Teleport Resources

```yaml
kind: cluster_maintenance_config
rosstimothy marked this conversation as resolved.
Show resolved Hide resolved
spec:
# agent_auto_update allows turning agent updates on or off at the
# cluster level. Only turn agent automatic updates off if self-managed
# agent updates are in place.
agent_auto_update: on|off
sclevine marked this conversation as resolved.
Show resolved Hide resolved
# agent_update_hour sets the hour in UTC at which clients should update their agents.
# The value -1 will set the upgrade time to the current time, resulting in immediate upgrades.
agent_update_hour: -1-23
# agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour.
# The agent upgrader will pick a random time within this duration in which to upgrade.
agent_update_jitter_seconds: 0-MAXINT64
hugoShaka marked this conversation as resolved.
Show resolved Hide resolved
sclevine marked this conversation as resolved.
Show resolved Hide resolved

[...]
```
```
$ tctl autoupdate update --set-agent-auto-update=off
sclevine marked this conversation as resolved.
Show resolved Hide resolved
Automatic updates configuration has been updated.
$ tctl autoupdate update --set-agent-update-hour=3
Automatic updates configuration has been updated.
$ tctl autoupdate update --set-agent-update-jitter-seconds=600
Automatic updates configuration has been updated.
```

```yaml
kind: autoupdate_version
spec:
# agent_version is the version of the agent the cluster will advertise.
# Can be auto (match the version of the proxy) or an exact semver formatted
# version.
agent_version: auto|X.Y.Z

[...]
```
```
$ tctl autoupdate update --set-agent-version=15.1.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to allow cloud customers to configure this? It would allow us to safely opt into versions at our own pace (until we can bucket instances into PROD/PREPROD etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloud performs cluster upgrades, and we'd like to ensure that the cluster and agent versions are always extensively tested and compatible. Uncoordinated cluster and agent upgrades could lead to incompatible versions and lost access to resources.

bucket instances into PROD/PREPROD etc

It sounds like you want to rollout new agent versions across the same cluster in buckets. Is this correct?

If so, I do have an idea for bucketed agent upgrades. It would look something like this:

  • Add bucket IDs that are settable via teleport-updater enable --bucket 2
  • Buckets time offsets would be user-configurable via tctl in cluster_maintenance_config
  • /v1/webapi/ping would return separate times for each bucket

Would this work for your use case?

I may open a separate RFD to add this in the future (depending on user feedback)

Copy link
Contributor

@thameezb thameezb Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you want to rollout new agent versions across the same cluster in buckets. Is this correct?

That is correct yes.

Would this work for your use case?

It can work (stable and latest buckets would be sufficient for us TBH), where all of our PREPROD agents are enrolled into latest and our PROD into stable. This is currently the system we use with our custom Teleport version server

However due to past issues with Teleport agent auto-upgrading, we would not be able to onboard onto this new approach unless we have the ability to control the roll-out of updates to our agents. Either as defined above, or something akin to #40190 (comment), while the approach defined above allows for bucketing agent updates

Automatic updates configuration has been updated.
```

Notes:
- These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout.

### Filesystem
sclevine marked this conversation as resolved.
Show resolved Hide resolved

```
$ tree /var/lib/teleport
/var/lib/teleport
└── versions
├── 15.0.0
│ ├── bin
│ │ ├── ...
│ │ ├── teleport-updater
│ │ └── teleport
│ └── etc
│ ├── ...
│ └── systemd
│ └── teleport.service
├── 15.1.1
│ ├── bin
│ │ ├── ...
│ │ ├── teleport-updater
│ │ └── teleport
│ └── etc
│ ├── ...
│ └── systemd
│ └── teleport.service
└── updates.yaml
$ ls -l /usr/local/bin/teleport
/usr/local/bin/teleport -> /var/lib/teleport/versions/15.0.0/bin/teleport
$ ls -l /usr/local/bin/teleport
sclevine marked this conversation as resolved.
Show resolved Hide resolved
/usr/local/bin/teleport-updater -> /var/lib/teleport/versions/15.0.0/bin/teleport-updater
$ ls -l /usr/local/lib/systemd/system/teleport.service
/usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service
```

updates.yaml:
```
version: v1
proxy: mytenant.teleport.sh
enabled: true
active_version: 15.1.1
```
sclevine marked this conversation as resolved.
Show resolved Hide resolved

### Runtime
Copy link
Contributor

@russjones russjones Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create another section called ### Installer?

One of the biggest problems I saw during during our push to get customers to switch to automatic updates was the number of different ways we have to install Teleport.

I'd like us to consolidate on a single way to install agents that 90% of our customers use. Ideally anything we build should work for self-hosted and Cloud similar to client tools updates.

The good news I think you have the answer.

$ apt-get install teleport-updater
$ teleport-updater enable --proxy proxy.example.com

This works for Cloud, but would also work for self-hosted customers. We just have to work through the edge cases.

Otherwise, what's left?

  • We need to make sure all of our install scripts are updated to use the new installation method above, for example the discover script and curl | bash on our downloads page. Please enumerate all the scripts/locations that will need to be updated.
  • We need to make sure the teleport-updater package can support OSS, Enterprise, and Enterprise FIPs installations.
  • We need to make sure the teleport-updater is available for all the different architectures we support (I think this is already the case).
  • We update our documentation to reference the new installation instructions. 1 2 3 4
  • Dashboard tenants downloads tab is updated to reference new instructions.

Let's make sure that's in-scope for this RFD.

Copy link
Contributor

@russjones russjones Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure the teleport-updater package can support OSS, Enterprise, and Enterprise FIPs installations.

I think we can put this information into the ping endpoint. Then you no longer have to worry about which binary you need to install (OSS, Enterprise, FIPS Enterprise), you just install the teleport-updater and it downloads and runs this right version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we should consider: the apt-get install && teleport-updater enable approach works fine if you running the commands on the host (or running something like Ansible), what would we recommend to someone that builds an AMI using this new install method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've covered everything above:

  • Added new "Installers" section with scripts changes and VM/container image instructions
  • Added new "Documentation" section with links and notes
  • New server_edition field is added to ping, along with associated changes to "Runtime" section logic

Let me know if this looks good.

Copy link
Contributor

@hugoShaka hugoShaka Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make this new installer the default for everything (which I think is a great approach, we must reduce complexity) I would suggest going further by making the new binary the only recommended way of installing teleport, even without Automatic Updates.

We could name it teleport-version-manager or whatever name that doesn't imply it requires AUs and expose the following commands:

  • tvm update as already described in the RFD
  • tvm follow/tvm unfollow to follow updates from a proxy, this is the current RFD enable/disable.
  • tvm install <X.Y.Z|teleport.example.com> would be an additional command not enabling AUs, but installing a static version (given, or from a proxy).

I think this would answer

what would we recommend to someone that builds an AMI using this new install method

We would recommend "install tvm, and run tvm install teleport.example.com (for non-AU setups) or tvm follow teleport.example.com (for AU) on startup".

Copy link
Member Author

@sclevine sclevine Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would tvm use-version vX.Y.Z or tvm use vX.Y.Z be better?

tvm use sounds better to me, but I don't have a strong opinion.

Happy to rename. I'll wait for a few others to chime in, and change it when we have some consensus -- names can be contentious 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though I really want to refactor our installation process, looking at our capacity for Q2, I don't think we can.

So I am in-favor of keeping it simple and solving the issue for Cloud customers first then coming back in Q3 and fixing our installation problems as a whole.

Here is what I am thinking.

  • We keep the name teleport-updater to not introduce too many changes at once.
  • Everywhere we are currently installing our auto updater, we only update it to install the new one.
  • Scope out self-hosted installation.

What do you guys think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Works for me. Maybe there are two phases here:

  • Ship this RFD, which minimizes user-facing changes
  • Re-design UX, simplify scripts, etc.

Copy link
Contributor

@hugoShaka hugoShaka May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we postpone the simplification we must absolutely ensure we do it the next quarter.

The installation methods are already a mess and we're adding new methods, instructions and scripts. This is not maintainable and self-hosted customers are already complaining about how hard it is to just install teleport. This is not an issue we can solve with docs. We've tried for the last year and the installation docs only became worse and more obscure since I joined Teleport 2 years ago.

I'm against adding new ways without removing the old ones but I fear that we'll do it anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we scope out self-hosted installations?
Is there extra work for those scenarios? Except for docs around the autoupdate_version resource?

Adding more complexity to distinguish between cloud/self-hosted in the installer scripts seems we are shifting the complexity to something which is way hard to test/maintain.


The agent-updater will run as a periodically executing systemd service which runs every 10 minutes.
sclevine marked this conversation as resolved.
Show resolved Hide resolved
The systemd service will run:
```shell
$ teleport-updater update
```

After it is installed, the `update` subcommand will no-op when executed until configured with the `teleport-updater` command:
```shell
$ teleport-updater enable --proxy mytenant.teleport.sh
```

If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used.

On servers without Teleport installed already, the `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running.
It will also run update teleport immediately, to ensure that subsequent executions succeed.
sclevine marked this conversation as resolved.
Show resolved Hide resolved

The `enable` subcommand will:
1. Configure `updates.yaml` with the current proxy address and set `enabled` to true.
sclevine marked this conversation as resolved.
Show resolved Hide resolved
2. Query the `/v1/webapi/ping` endpoint.
sclevine marked this conversation as resolved.
Show resolved Hide resolved
3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit.
4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (12).
5. Download the desired Teleport tarball specified by `agent_version`.
6. Verify the checksum.
sclevine marked this conversation as resolved.
Show resolved Hide resolved
7. Extract the tarball to `/var/lib/teleport/versions/VERSION`.
8. Replace any existing binaries or symlinks with symlinks to the current version.
9. Restart the agent if the systemd service is already enabled.
10. Set `active_version` in `updates.yaml` if successful or not enabled.
11. Replace the old symlinks or binaries and quit (exit 1) if unsuccessful.
12. Remove any `teleport` package if installed.
13. Verify the symlinks to the active version still exists.
14. Remove all stored versions of the agent except the current version and last working version.

The `disable` subcommand will:
1. Configure `updates.yaml` to set `enabled` to false.

When `update` subcommand is otherwise executed, it will:
sclevine marked this conversation as resolved.
Show resolved Hide resolved
1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set.
sclevine marked this conversation as resolved.
Show resolved Hide resolved
2. Query the `/v1/webapi/ping` endpoint.
sclevine marked this conversation as resolved.
Show resolved Hide resolved
3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true.
sclevine marked this conversation as resolved.
Show resolved Hide resolved
4. If the current version of Teleport is the latest, quit.
5. Wait `random(0, agent_update_jitter_seconds)` seconds.
6. Download the desired Teleport tarball specified by `agent_version`.
7. Verify the checksum.
8. Extract the tarball to `/var/lib/teleport/versions/VERSION`.
9. Update symlinks to point at the new version.
10. Restart the agent if the systemd service is already enabled.
11. Set `active_version` in `updates.yaml` if successful or not enabled.
12. Replace the old symlink or binary and quit (exit 1) if unsuccessful.
13. Remove all stored versions of the agent except the current version and last working version.

sclevine marked this conversation as resolved.
Show resolved Hide resolved
To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different.
The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios.

### Manual Workflow

For use cases that fall outside of the functionality provided by `teleport-updater`, such as JamF or ansible-controlled updates, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint.
sclevine marked this conversation as resolved.
Show resolved Hide resolved

Cluster administrators that want to self-manage client tools updates will be
sclevine marked this conversation as resolved.
Show resolved Hide resolved
able to get and watch for changes to agent versions which can then be
used to trigger other integrations to update the installed version of agents.

```shell
$ tctl autoupdate watch
{"agent_version": "1.0.0"}
{"agent_version": "1.0.1"}
{"agent_version": "2.0.0"}
[...]
```

```shell
$ tctl autoupdate get
{"agent_version": "2.0.0"}
```

### Scripts
sclevine marked this conversation as resolved.
Show resolved Hide resolved

All scripts will install the latest updater and run `teleport-updater enable` with the proxy address.
sclevine marked this conversation as resolved.
Show resolved Hide resolved

Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport.

This is out-of-scope for this proposal.

## Security
sclevine marked this conversation as resolved.
Show resolved Hide resolved

The initial version of automatic updates will rely on TLS to establish
sclevine marked this conversation as resolved.
Show resolved Hide resolved
connection authenticity to the Teleport download server. The authenticity of
assets served from the download server is out of scope for this RFD. Cluster
administrators concerned with the authenticity of assets served from the
download server can use self-managed updates with system package managers which
are signed.

The Upgrade Framework (TUF) will be used to implement secure updates in the future.

## Execution Plan
bernardjkim marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we build some kind of observability around this and use this as a success criteria?

  • for the users (so they know that teleport has been installed by the updater, and they know if the updater was used in one-shot or is actively following the teleport cluster version)
  • for cloud operations, so we know what's the tenant state and can send the apropriate communications/do the right maintenances
  • for product decision so we can measure the new installer/updater adoption for customers with dial-home:
    • report the number of updater-installed teleport agents
    • main distro variant / kernel version
    • updater version?

I think we already have some kind of environment variable that teleport can pick up to report on which kind of en it's running, maybe we could extend this mechanism and have the unified updater set metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional telemetry notes: we have an existing env var set by the updater and reported by the agent in the heartbeats. This allows detecting which agent is enrolled into AUs and raise alerts if some agents are not. The new updater must continue setting this variable, or we need to have the agent check for updater.yaml and report if enabled is true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a section on telemetry. What is the env var?


1. Implement new auto-updater in Go.
2. Prep documentation changes.
3. Release new updater via teleport-ent-updater package.
4. Release documentation changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After implementing the updater we should spend some time to test extensively how the upgrade from the existing APT setups.

This list should also describe the cloud rollout strategy for existing customers (with or without AUs) so we don't cause the same tenant fragmentation as the last time.

Loading