-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 0169: Automatic Updates for Agents #40190
Conversation
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
1 similar comment
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
active_version: 15.1.1 | ||
``` | ||
|
||
### Runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create another section called ### Installer
?
One of the biggest problems I saw during during our push to get customers to switch to automatic updates was the number of different ways we have to install Teleport.
I'd like us to consolidate on a single way to install agents that 90% of our customers use. Ideally anything we build should work for self-hosted and Cloud similar to client tools updates.
The good news I think you have the answer.
$ apt-get install teleport-updater
$ teleport-updater enable --proxy proxy.example.com
This works for Cloud, but would also work for self-hosted customers. We just have to work through the edge cases.
Otherwise, what's left?
- We need to make sure all of our install scripts are updated to use the new installation method above, for example the discover script and
curl | bash
on our downloads page. Please enumerate all the scripts/locations that will need to be updated. - We need to make sure the
teleport-updater
package can support OSS, Enterprise, and Enterprise FIPs installations. - We need to make sure the
teleport-updater
is available for all the different architectures we support (I think this is already the case). - We update our documentation to reference the new installation instructions. 1 2 3 4
- Dashboard tenants downloads tab is updated to reference new instructions.
Let's make sure that's in-scope for this RFD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make sure the teleport-updater package can support OSS, Enterprise, and Enterprise FIPs installations.
I think we can put this information into the ping
endpoint. Then you no longer have to worry about which binary you need to install (OSS, Enterprise, FIPS Enterprise), you just install the teleport-updater
and it downloads and runs this right version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we should consider: the apt-get install && teleport-updater enable
approach works fine if you running the commands on the host (or running something like Ansible), what would we recommend to someone that builds an AMI using this new install method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've covered everything above:
- Added new "Installers" section with scripts changes and VM/container image instructions
- Added new "Documentation" section with links and notes
- New
server_edition
field is added toping
, along with associated changes to "Runtime" section logic
Let me know if this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make this new installer the default for everything (which I think is a great approach, we must reduce complexity) I would suggest going further by making the new binary the only recommended way of installing teleport, even without Automatic Updates.
We could name it teleport-version-manager
or whatever name that doesn't imply it requires AUs and expose the following commands:
tvm update
as already described in the RFDtvm follow
/tvm unfollow
to follow updates from a proxy, this is the current RFDenable
/disable
.tvm install <X.Y.Z|teleport.example.com>
would be an additional command not enabling AUs, but installing a static version (given, or from a proxy).
I think this would answer
what would we recommend to someone that builds an AMI using this new install method
We would recommend "install tvm, and run tvm install teleport.example.com
(for non-AU setups) or tvm follow teleport.example.com
(for AU) on startup".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would tvm use-version vX.Y.Z or tvm use vX.Y.Z be better?
tvm use
sounds better to me, but I don't have a strong opinion.
Happy to rename. I'll wait for a few others to chime in, and change it when we have some consensus -- names can be contentious 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though I really want to refactor our installation process, looking at our capacity for Q2, I don't think we can.
So I am in-favor of keeping it simple and solving the issue for Cloud customers first then coming back in Q3 and fixing our installation problems as a whole.
Here is what I am thinking.
- We keep the name
teleport-updater
to not introduce too many changes at once. - Everywhere we are currently installing our auto updater, we only update it to install the new one.
- Scope out self-hosted installation.
What do you guys think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Works for me. Maybe there are two phases here:
- Ship this RFD, which minimizes user-facing changes
- Re-design UX, simplify scripts, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we postpone the simplification we must absolutely ensure we do it the next quarter.
The installation methods are already a mess and we're adding new methods, instructions and scripts. This is not maintainable and self-hosted customers are already complaining about how hard it is to just install teleport. This is not an issue we can solve with docs. We've tried for the last year and the installation docs only became worse and more obscure since I joined Teleport 2 years ago.
I'm against adding new ways without removing the old ones but I fear that we'll do it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we scope out self-hosted installations?
Is there extra work for those scenarios? Except for docs around the autoupdate_version
resource?
Adding more complexity to distinguish between cloud/self-hosted in the installer scripts seems we are shifting the complexity to something which is way hard to test/maintain.
Co-authored-by: Russell Jones <russjones@users.noreply.github.com>
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
Co-authored-by: Russell Jones <russjones@users.noreply.github.com>
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
Co-authored-by: Russell Jones <russjones@users.noreply.github.com>
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass. I think I need to read it a few more times to get up to speed, but sometimes feedback from someone seeing something for the first time is helpful, so here you go.
Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. | ||
The version and edition served from that endpoint will be configured using new `autoupdate_version` resource. | ||
|
||
Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is Teleport updater? (I assume this is a new tool we're building, but this is my first read through of the RFD and the term was introduced without context)
If it is a new program, do we need a separate program? Managing the version of some updater and the version of Teleport simultaneously has proven difficult. Have we considered exposing the updater as a subcommand of Teleport?
If the updater itself was part of Teleport, then the updater could use the agent's own certificates to authenticate itself and we wouldn't need to surface upgrade instructions over an unauthenticated channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is Teleport updater?
Sorry, this was more clear when the Runtime section was earlier in the document. This is a separate binary, in tools/teleport-update
. The binary is versioned with Teleport, but may be distributed separately from the other Teleport binaries (similar to tbot or tsh).
do we need a separate program?
The updater is intentionally not part of the teleport binary for a few reasons:
teleport-update
is distributed separately from teleport
, and without any of the other binaries, to bootstrap an initial agent installation. Including all of Teleport would mean that security scanners would match all Teleport dependencies in the bootstrap/update artifact that could be vulnerable, so customers would have to handle rebuilding artifacts (e.g., AMIs) more often.
teleport-update
will also handle long-running tbot updates in the future, and I hesitate to distribute teleport
with tbot
.
teleport-update
's API contract with the cluster must be extremely stable. The first GA version of teleport-update
will need to work with Teleport v14+, and should ideally work with all future versions of Teleport. This seems (a bit) easier to manage with a separate binary.
then the updater could use the agent's own certificates to authenticate itself and we wouldn't need to surface upgrade instructions over an unauthenticated channel.
We will not always have an authenticated connection to auth to receive these instructions. E.g., the agent could be a failed state due to a botched upgrade, temporarily stopped, or newly installed. In the future, tbot-only installations may have expired certificates. Keeping this unauthenticated reduces complexity. The information exposed is minimal, and the API has been reviewed by Doyensec.
(Will add this context to the RFD.)
|
||
The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. | ||
|
||
The following data related to the rollout are stored in each instance heartbeat: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all agents already perform instance heartbeats?
|
||
```shell | ||
$ apt-get install teleport-ent-updater | ||
$ teleport-update enable --proxy example.teleport.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I'm clear, teleport-update enable
would actually download teleport if it's not already installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, yes. This ensures the configuration is valid, and that further updates can be applied. We may want to provide a --no-update
flag in the future, so e.g., AMIs won't need changes to cloud-init to do the initial installation.
updates.yaml: | ||
``` | ||
version: v1 | ||
kind: agent_versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks an awful lot like a teleport resource, but it seems this is just a configuration file for the updater? If so, I might not use the same YAML structure with a version and kind to avoid people thinking that this is a Teleport resource that they need to tctl create
. (Similar to how our access plugins or event handler are configured)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@russjones requested this format explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original thread: #40190 (comment)
(I feel like it's confusing as well)
active_version: 15.1.1 | ||
``` | ||
|
||
backup.yaml: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this supposed to do? I'm not sure if I missed something above or just haven't got to it yet, but I'm finding it hard to comprehend with just a snippet of YAML and no other commentary.
Additionally, when I see kind: db_backup
nothing about that name suggests that this is in any way related to agent auto upgrades. Is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll make this more explicit in the section below.
It's metadata about the sqlite DB backup, referred to here: https://github.com/gravitational/teleport/pull/40190/files#diff-4fb277bd3d884c7b045e7efad61111ac4e7957c063e0c0c9281f4e74fd72997fR571
(Users don't need to worry about this file -- it's an implementation detail.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing about that name suggests that this is in any way related to agent auto upgrades. Is that intended?
This file is not a Teleport resource (see #40190 (comment)). It's machine-local configuration storage for individual Teleport backups. That said, the metadata is descriptive of any backup of Teleport's sqlite DB, so it could describe non-agent data.
The `autoupdate_agent_plan` spec is owned by the Teleport cluster administrator. | ||
In Teleport Cloud, this is the cloud operations team. For self-hosted setups this is the user with access to the local | ||
admin socket (tctl on local machine). | ||
|
||
> [!NOTE] | ||
> This is currently an anti-pattern as we are trying to remove the use of the local administrator in Teleport. | ||
> However, Teleport does not provide any role/permission that we can use for Teleport Cloud operations and cannot be | ||
> granted to users. To part with local admin rights, we need a way to have cloud or admi-only operations. | ||
> This would also improve Cloud team operations by interacting with Teleport API rather than executing local tctl. | ||
> | ||
> Solving this problem is out of the scope of this RFD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's defer this note to the Cloud RFD
@@ -479,9 +536,12 @@ However, any changes to `agent_schedules` that occur while a group is active wil | |||
Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. | |||
|
|||
Note that the `default` schedule applies to agents that do not specify a group name. | |||
[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We got feedback that the boolean was confusing during the feedback sessions. No strong opinion. How about last-as-default, but only if no default group is explicitly defined?
days: [ “Sun”, “Mon”, ... | "*" ] | ||
# start_hour specifies the hour when the group may start upgrading. | ||
# default: 0 | ||
start_hour: 0-23 | ||
# wait_days specifies how many days to wait after the previous group finished before starting. | ||
# default: 0 | ||
wait_days: 0-1 | ||
# TODO: is this needed? In which case a customer would need to set a custom jitter? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay to drop the configuration option for now. Thought it might be important for self-hosted customers with many agents. I would like the jitter to continue to be served from the cluster in case we need to adjust it.
@@ -454,23 +512,22 @@ spec: | |||
agent_schedules: | |||
regular: | |||
- name: default | |||
days: ["*"] | |||
days: ["*"] # TODO: restrict to work week? Minus Friday? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was assuming the restriction would be a Cloud-only setting, and self-hosted customers would have full control.
- the installation process is complex and users end up installing the wrong version of Teleport | ||
- the current update process does not provide safeties to protect against broken updates | ||
- many customers are not adopting the existing updater because they want to control when updates happen | ||
- we don't offer a ni |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(missing words)
I create a new deployment, with a broken version. The version is deployed to the canaries. | ||
The canaries crash, the updater reverts the update, the agents connect back online and | ||
advertise they rolled-back. The maintenance is stuck until the canaries are running the target version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is zero mention of a canary prior to this and until I scrolled down and realized that canaries were now part of the upgrade plan I had no idea what this text was talking about. Without having a high level understanding for how this system is supposed to work some of this UX section is a bit hard to grok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RFD is currently being used for Hugo and I to collaborate. We'll have a new RFD open before Thursday.
Continued in #47126 |
This RFD proposes a new mechanism for Teleport agents to automatically update to a version set by an operator via tctl.
Readable