-
Notifications
You must be signed in to change notification settings - Fork 352
Linkerd agent #1157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Linkerd agent #1157
Conversation
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Needed to manually copy/paste some HTML to get some older data to show what that looks like :) <img width="517" height="673" alt="image" src="https://github.com/user-attachments/assets/b590058d-624a-443f-b818-14989ede9e7d" /> Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
note: unsure where the affinity template updates came from, but they get generated with the gen makefile target. maybe from kagent-dev#1085, but surprised it's generating on my pr 3 weeks after the merge 🤔 # Changes - Hashing secret alongside config-hash annotation for agent pod, so when a referenced secret updates, it restarts - Added `SecretHash` status on ModelConfig so that changes to underlying referenced secrets are propagated (resource version updates) to Agent reconciliation <img width="2067" height="1464" alt="image" src="https://github.com/user-attachments/assets/a1b74d88-17f8-45fd-b334-cc1f2553a47f" /> With these changes… 1. When a Secret updates, a ModelConfig will update its status to reflect new hash. 2. ModelConfig updates resource version 3. The agent watching over modelConfig sees resource update 4. Agent reconciles, updating the annotation on the pod. 5. Agent restarts, loading in new secrets ## Golden Test Changes - Notes The outputs for golden test annotations have _not_ changed, because the annotation hash relies on the modelconfig status which has Secret updates (hash). Modelconfig needs to reconcile for status, and does not reconcile in test, so `byte{}` (no change) is written to the hash. # Context With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll need a UX-friendly way for agents to update with the latest Secret (e.g. cert rotation, api key change) without requiring users to manually restart the agent. Note: We can’t rely on dynamic volume mounting, as the ca cert is read on agent start so that it configures the cached client. The api key also needed a way for its update to propagate to the agent. ## Demo _steps_ [agent restart validation steps.md](https://github.com/user-attachments/files/23664735/agent.restart.validation.steps.md) _video_ https://github.com/user-attachments/assets/eca62fb4-2ca2-45eb-94ba-7dfd0db5244b ## Alternative Solutions _feedback wanted_ ### Per-Secret Status Instead of hashing all secrets into a single final hash to store in the ModelConfig’s status, we could store a status per-Secret. For example, the status would change from: ```yaml status: […] SecretHash: XYZ ``` to something like ```yaml status: […] Secrets: APIKey: Hash/Version: 123 TLS: Hash/Version: 123 ``` I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates. We would need to see _how_ we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now. ### Avoiding Restart Requirement We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used. Pros: - Avoiding restart requirement Cons: - Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy) - We won’t be able to do any validation checks like we do now on startup. --- Resolves kagent-dev#1091 --------- Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…ent-dev#1137) Split this out of kagent-dev#1133 to try reduce the size of that PR - but also because it's not strictly related to being able to scale the controller - it simply manifested when needing to switch to postgres when running multiple controller replicas. Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…-dev#1140) Another artifact of kagent-dev#1133. No need for the sqlite volume+mount when database is set to postgres. Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…te database (kagent-dev#1144) Running multiple controller replicas when using a local SQLite database will lead to errors as API requests will inevitably end up being handled by a replicas that does not have the local state (e.g. A2A session). This check/error hopefully prevents users from making this mistake. Split out from kagent-dev#1133 Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Enables local testing using postgres as a backing store for controller. Split out from kagent-dev#1133 (with added docs). --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io> Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
**Yet another PR split out from kagent-dev#1133 to try reduce review burden** - keeping that one open for now as all of these other PRs are ultimately working towards that goal. This PR refactors the kagent controller to support the use of environment variables for configuration in addition to command-line arguments. It also updates the Helm chart to make use of env vars instead of command line args and adds the ability for user's to supply their own environment variables with custom configuration. This allows users to supply sensitive configuration (e.g. postgres database url) via secrets instead of exposing these via `args`. Env vars are also easier to patch when working with rendered manifests if needed. --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…n guidelines, update README (kagent-dev#1142) Expand the internal documentation for users to participate in the project. --------- Signed-off-by: Sam Heilbron <samheilbron@gmail.com> Signed-off-by: Sam Heilbron <SamHeilbron@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
This PR enables leader election on the controller if it is configured with one than 1 replica to ensure that only 1 replica is actively reconciling watched manifests. It also ensures that the necessary RBAC manifests are created. Final part of kagent-dev#1133 (excluding kagent-dev#1138). --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…econcilation (kagent-dev#1138) **Decided to split this out of kagent-dev#1133 to try make review a little easier as it's a chunky commit that can live in isolation of the rest of the changes in that PR** This change separates A2A handler registration from the main `Agent` controller reconciliation loop by introducing a dedicated `A2ARegistrar` that manages the A2A routing table independently from the main controller. Currently, A2A handler registration is tightly coupled to the `Agent` controller's reconciliation loop, which performs the following operations: 1. Reconcile Kubernetes resources (Deployment, Service, etc.) 2. Store agent metadata in database 3. Register A2A handler in routing table 4. Update resource status This coupling is problematic for a number of reasons: 1. Breaks horizontal scaling - with leader election enabled (required to prevent duplicate reconciliation), only the leader pod performs reconciliation and registers A2A handlers. When API requests hit non-leader replicas, they fail because those replicas lack the necessary handler registrations. 2. Could be argued that this violates separation of concerns - the controller handles both cluster resource management (its core responsibility) and API routing configuration (an orthogonal concern). 3. Makes future architectural changes (e.g., splitting API and control plane) unnecessarily complex. This PR attempts to address those concerns ensuring that all controller replicas, when scaled, will maintain consistent A2A routing tables enabling transparent load balancing across replicas. A2A logic is also consolidated into a dedicated package rather than scattered across controller code ensuring a clean separation of API and control plane such that these could be split into independent deployments without significant refactoring in future. --------- Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Signed-off-by: jiangdong <jiangdong@iflytek.com> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
940d471 to
a4fb08b
Compare
Signed-off-by: Ivan (이반) Porta <porta.ivan@outlook.com>
| helm $(HELM_ACTION) kagent helm/kagent \ | ||
| --namespace kagent \ | ||
| --create-namespace \ | ||
| --history-max 2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for removing this?
| kind: RemoteMCPServer | ||
| apiGroup: kagent.dev | ||
| toolNames: | ||
| - k8s_create_resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a lot of tools here (we recommend <20) -- is there anthing that could be removed?
|
Given that this PR is blocked on kagent-dev/tools#34 do you think we can move this into draft for now? |
First version of the Linkerd OSS Agent. It enables users to inject Linkerd proxies and use Linkerd CLI subcommands to inspect certificates, and check control-plane and data-plane health.
Diagnostics commands are included to simplify troubleshooting of policies, endpoints, and profiles.
Tools PR: kagent-dev/tools#31