Skip to content

Conversation

@GTRekter
Copy link

@GTRekter GTRekter commented Dec 4, 2025

First version of the Linkerd OSS Agent. It enables users to inject Linkerd proxies and use Linkerd CLI subcommands to inspect certificates, and check control-plane and data-plane health.

Diagnostics commands are included to simplify troubleshooting of policies, endpoints, and profiles.

Tools PR: kagent-dev/tools#31

GTRekter and others added 15 commits December 4, 2025 23:54
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Needed to manually copy/paste some HTML to get some older data to show
what that looks like :)

<img width="517" height="673" alt="image"
src="https://github.com/user-attachments/assets/b590058d-624a-443f-b818-14989ede9e7d"
/>

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
note: unsure where the affinity template updates came from, but they get
generated with the gen makefile target. maybe from kagent-dev#1085, but surprised
it's generating on my pr 3 weeks after the merge 🤔

# Changes
- Hashing secret alongside config-hash annotation for agent pod, so when
a referenced secret updates, it restarts
- Added `SecretHash` status on ModelConfig so that changes to underlying
referenced secrets are propagated (resource version updates) to Agent
reconciliation

<img width="2067" height="1464" alt="image"
src="https://github.com/user-attachments/assets/a1b74d88-17f8-45fd-b334-cc1f2553a47f"
/>

With these changes…
1. When a Secret updates, a ModelConfig will update its status to
reflect new hash.
2. ModelConfig updates resource version
3. The agent watching over modelConfig sees resource update
4. Agent reconciles, updating the annotation on the pod.
5. Agent restarts, loading in new secrets

## Golden Test Changes - Notes

The outputs for golden test annotations have _not_ changed, because the
annotation hash relies on the modelconfig status which has Secret
updates (hash). Modelconfig needs to reconcile for status, and does not
reconcile in test, so `byte{}` (no change) is written to the hash.

# Context

With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll
need a UX-friendly way for agents to update with the latest Secret (e.g.
cert rotation, api key change) without requiring users to manually
restart the agent.

Note: We can’t rely on dynamic volume mounting, as the ca cert is read
on agent start so that it configures the cached client. The api key also
needed a way for its update to propagate to the agent.

## Demo

_steps_

[agent restart validation
steps.md](https://github.com/user-attachments/files/23664735/agent.restart.validation.steps.md)

_video_

https://github.com/user-attachments/assets/eca62fb4-2ca2-45eb-94ba-7dfd0db5244b

## Alternative Solutions

_feedback wanted_

### Per-Secret Status

Instead of hashing all secrets into a single final hash to store in the
ModelConfig’s status, we could store a status per-Secret.

For example, the status would change from:
```yaml
status:
    […]
    SecretHash: XYZ
```

to something like
```yaml
status:
  […]
  Secrets:
    APIKey:
      Hash/Version: 123
    TLS:
      Hash/Version: 123
```

I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates.

We would need to see _how_ we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now.

### Avoiding Restart Requirement

We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used.

Pros:
- Avoiding restart requirement
Cons:
- Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy)
- We won’t be able to do any validation checks like we do now on startup.

---

Resolves kagent-dev#1091

---------

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…ent-dev#1137)

Split this out of kagent-dev#1133 to try reduce the size of that PR - but also
because it's not strictly related to being able to scale the controller
- it simply manifested when needing to switch to postgres when running
multiple controller replicas.

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…-dev#1140)

Another artifact of kagent-dev#1133. No need for the sqlite volume+mount when
database is set to postgres.

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…te database (kagent-dev#1144)

Running multiple controller replicas when using a local SQLite database
will lead to errors as API requests will inevitably end up being handled
by a replicas that does not have the local state (e.g. A2A session).
This check/error hopefully prevents users from making this mistake.

Split out from kagent-dev#1133

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Enables local testing using postgres as a backing store for controller.

Split out from kagent-dev#1133 (with added docs).

---------

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
**Yet another PR split out from kagent-dev#1133 to try reduce review burden** -
keeping that one open for now as all of these other PRs are ultimately
working towards that goal.

This PR refactors the kagent controller to support the use of
environment variables for configuration in addition to command-line
arguments. It also updates the Helm chart to make use of env vars
instead of command line args and adds the ability for user's to supply
their own environment variables with custom configuration. This allows
users to supply sensitive configuration (e.g. postgres database url) via
secrets instead of exposing these via `args`. Env vars are also easier
to patch when working with rendered manifests if needed.

---------

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…n guidelines, update README (kagent-dev#1142)

Expand the internal documentation for users to participate in the
project.

---------

Signed-off-by: Sam Heilbron <samheilbron@gmail.com>
Signed-off-by: Sam Heilbron <SamHeilbron@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
This PR enables leader election on the controller if it is configured
with one than 1 replica to ensure that only 1 replica is actively
reconciling watched manifests. It also ensures that the necessary RBAC
manifests are created.

Final part of kagent-dev#1133 (excluding kagent-dev#1138).

---------

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
…econcilation (kagent-dev#1138)

**Decided to split this out of
kagent-dev#1133 to try make review a
little easier as it's a chunky commit that can live in isolation of the
rest of the changes in that PR**

This change separates A2A handler registration from the main `Agent`
controller reconciliation loop by introducing a dedicated `A2ARegistrar`
that manages the A2A routing table independently from the main
controller.

Currently, A2A handler registration is tightly coupled to the `Agent`
controller's reconciliation loop, which performs the following
operations:
1. Reconcile Kubernetes resources (Deployment, Service, etc.)
2. Store agent metadata in database
3. Register A2A handler in routing table
4. Update resource status

This coupling is problematic for a number of reasons:
1. Breaks horizontal scaling - with leader election enabled (required to
prevent duplicate reconciliation), only the leader pod performs
reconciliation and registers A2A handlers. When API requests hit
non-leader replicas, they fail because those replicas lack the necessary
handler registrations.
2. Could be argued that this violates separation of concerns - the
controller handles both cluster resource management (its core
responsibility) and API routing configuration (an orthogonal concern).
3. Makes future architectural changes (e.g., splitting API and control
plane) unnecessarily complex.

This PR attempts to address those concerns ensuring that all controller
replicas, when scaled, will maintain consistent A2A routing tables
enabling transparent load balancing across replicas. A2A logic is also
consolidated into a dedicated package rather than scattered across
controller code ensuring a clean separation of API and control plane
such that these could be split into independent deployments without
significant refactoring in future.

---------

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
)

Signed-off-by: jiangdong <jiangdong@iflytek.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Signed-off-by: jiangdong <jiangdong@iflytek.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Signed-off-by: Ivan (이반) Porta <porta.ivan@outlook.com>
helm $(HELM_ACTION) kagent helm/kagent \
--namespace kagent \
--create-namespace \
--history-max 2 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for removing this?

kind: RemoteMCPServer
apiGroup: kagent.dev
toolNames:
- k8s_create_resource
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a lot of tools here (we recommend <20) -- is there anthing that could be removed?

@EItanya
Copy link
Contributor

EItanya commented Dec 10, 2025

Given that this PR is blocked on kagent-dev/tools#34 do you think we can move this into draft for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants