Restart Agents Automatically On Secret Updates #1121

inFocus7 · 2025-11-20T22:50:21Z

note: unsure where the affinity template updates came from, but they get generated with the gen makefile target. maybe from #1085, but surprised it's generating on my pr 3 weeks after the merge 🤔

Changes

Hashing secret alongside config-hash annotation for agent pod, so when a referenced secret updates, it restarts
Added SecretHash status on ModelConfig so that changes to underlying referenced secrets are propagated (resource version updates) to Agent reconciliation

With these changes…

When a Secret updates, a ModelConfig will update its status to reflect new hash.
ModelConfig updates resource version
The agent watching over modelConfig sees resource update
Agent reconciles, updating the annotation on the pod.
Agent restarts, loading in new secrets

Golden Test Changes - Notes

The outputs for golden test annotations have not changed, because the annotation hash relies on the modelconfig status which has Secret updates (hash). Modelconfig needs to reconcile for status, and does not reconcile in test, so byte{} (no change) is written to the hash.

Context

With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll need a UX-friendly way for agents to update with the latest Secret (e.g. cert rotation, api key change) without requiring users to manually restart the agent.

Note: We can’t rely on dynamic volume mounting, as the ca cert is read on agent start so that it configures the cached client. The api key also needed a way for its update to propagate to the agent.

Demo

steps

agent restart validation steps.md

video

agent-restart-work.mp4

Alternative Solutions

feedback wanted

Per-Secret Status

Instead of hashing all secrets into a single final hash to store in the ModelConfig’s status, we could store a status per-Secret.

For example, the status would change from: ```yaml
status:
[…]
SecretHash: XYZ


to something like
```yaml
status:
  […]
  Secrets:
    APIKey:
      Hash/Version: 123
    TLS:
      Hash/Version: 123

I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates.

We would need to see how we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now.

Avoiding Restart Requirement

We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used.

Pros:

Avoiding restart requirement
Cons:
Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy)
We won’t be able to do any validation checks like we do now on startup.

Resolves #1091

…m modelconfig to agent Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

inFocus7 · 2025-11-21T16:23:13Z

Ah golden test failures due to the new hash annotation. Makes sense, will fix.

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

Copilot

Pull request overview

This PR implements automatic agent restarts when secrets referenced by ModelConfigs are updated. The solution involves computing a hash of referenced secrets (API key and TLS CA certificates) and storing it in the ModelConfig status. This hash is then propagated to agent pod annotations, triggering pod restarts when secrets change.

Key Changes

Added SecretHash field to ModelConfig status to track changes in referenced secrets
Modified agent translator to include model config secret hash in pod annotations
Updated ModelConfig reconciler to compute and track secret hashes, triggering status updates
Added test coverage for the hash computation function ensuring deterministic output

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
go/api/v1alpha2/modelconfig_types.go	Added SecretHash field to ModelConfigStatus
go/internal/controller/reconciler/reconciler.go	Implemented secret hash computation and status tracking logic
go/internal/controller/reconciler/reconciler_test.go	Added comprehensive tests for hash computation
go/internal/controller/translator/agent/adk_api_translator.go	Updated agent translation to propagate secret hash to pod annotations
go/internal/controller/modelconfig_controller.go	Enhanced secret reference checking to include TLS certificates
go/config/crd/bases/kagent.dev_modelconfigs.yaml	Added secretHash field to CRD schema and duplicate validation rule
helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml	Updated Helm CRD template with same changes as base CRD
helm/kagent-crds/templates/kagent.dev_agents.yaml	Added affinity, nodeSelector, and tolerations fields (auto-generated)
go/internal/controller/translator/agent/testdata/outputs/*.json	Updated golden test outputs with empty secret-hash annotation

Comments suppressed due to low confidence (2)

go/config/crd/bases/kagent.dev_modelconfigs.yaml:652

Duplicate validation rule: Lines 643-646 and 647-652 both validate that caCertSecretKey requires caCertSecretRef. The second rule adds the condition "(unless disableVerify is true)" but appears to be redundant with the first rule. The first rule on lines 643-646 should either be removed or the logic should be consolidated into a single validation rule.

            - message: caCertSecretKey requires caCertSecretRef
              rule: '!(has(self.tls) && has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
                > 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
                == 0))'
            - message: caCertSecretKey requires caCertSecretRef (unless disableVerify
                is true)
              rule: '!(has(self.tls) && (!has(self.tls.disableVerify) || !self.tls.disableVerify)
                && has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
                > 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
                == 0))'

helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml:652

Duplicate validation rule: Lines 643-646 and 647-652 both validate that caCertSecretKey requires caCertSecretRef. The second rule adds the condition "(unless disableVerify is true)" but appears to be redundant with the first rule. The first rule on lines 643-646 should either be removed or the logic should be consolidated into a single validation rule.

            - message: caCertSecretKey requires caCertSecretRef
              rule: '!(has(self.tls) && has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
                > 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
                == 0))'
            - message: caCertSecretKey requires caCertSecretRef (unless disableVerify
                is true)
              rule: '!(has(self.tls) && (!has(self.tls.disableVerify) || !self.tls.disableVerify)
                && has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
                > 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
                == 0))'

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

go/internal/controller/reconciler/reconciler.go

go/internal/controller/translator/agent/adk_api_translator.go

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

EItanya · 2025-11-24T22:00:35Z

Why is this change necessary? The way that secrets are mounted into the agent pods is all via filesystem mounts, can't we just watch for updates on those files?

onematchfox · 2025-11-25T07:39:31Z

The way that secrets are mounted into the agent pods is all via filesystem mounts, can't we just watch for updates on those files?

This feels like the right way of doing this to me at least. A file watcher would need to be implemented within the agent itself; but also doesn't cater for all use cases - API keys are not mounted as files at least.

EItanya · 2025-11-25T15:24:41Z

The way that secrets are mounted into the agent pods is all via filesystem mounts, can't we just watch for updates on those files?

This feels like the right way of doing this to me at least. A file watcher would need to be implemented within the agent itself; but also doesn't cater for all use cases - API keys are not mounted as files at least.

The more I thought about it the more I think you are correct actually, going to re-review this today

EItanya

My main comment here is that I think a separate secretHash is probably unnecessary. configHash is meant to represent the state of all inputs which created this particular iteration of an agent, therefore I think we should just add this to the existing hash rather than creating a new one. What do you think?

inFocus7 · 2025-11-25T16:44:22Z

(copied from earlier message)

They're similar, but the existing config hash annotation is the hashed adk.AgentConfig which doesn't include any model information (aside from model type).

It's why i explicitly added agent to parameters, since there's also "ModelConfig", and the generic "config" naming confused me for a second

Alternatives:

I could create a new structure for this to also include secret data and hash it all together {AgentConfig, SecretData).
I could expand the AgentConfig to add secret data, but this config defined in both go & python, so need to look into what this impacts overall

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

EItanya

Overall this is looking great, just the one last comment

EItanya · 2025-12-01T13:42:51Z

go/api/v1alpha2/modelconfig_types.go

 	Conditions         []metav1.Condition `json:"conditions"`
 	ObservedGeneration int64              `json:"observedGeneration"`
+	// The secret hash stores a hash of any secrets required by the model config (i.e. api key, tls cert) to ensure agents referencing this model config detect changes to these secrets and restart if necessary.
+	SecretHash string `json:"secretHash,omitempty"`


Right now the main reconciler will run for secret changes anyway, so this isn't strictly necessary. Also, right now the agent reconciler doesn't wait for ModelConfig resources to be accepted, which would have to change in order for this to work. For now can we just move the Hash computation into the agent reconciler directly and then we can separate it later if we need to?

So prefer the change below:?

Current Updates

Agent watches ModelConfigs

ModelConfig watches Secrets

ModelConfig reconciles if Secret in its spec.TLS/API updates

ModelConfig updates Status using Secret hash (new)

Agent reconciles due to update of ModelConfig

Agent uses ModelConfig status for pod annotation (new)

"An Agent reconciles because its ModelConfig updates. A ModelConfig updates because a Secret it uses updates."

New Implementation

Agent watches ModelConfigs and Secrets (new)

Because the Agent is now also watching Secret, should we update its status with an error if a secret DNE? Or leave that for the ModelConfig resource?

Agent reconciles if a Secret in its spec.ModelConfig.TLS/API updates (new)

Agent uses calculated Secret hash for pod annotation (new)

"An Agent reconciles because a Secret in its ModelConfig updates."

Ok, so I was mistaken about the current code. I assumed that the agent already watched secrets, but it does not, which is I guess part of the bug being fixed 😆. In which case the current code is correct, and this comment can be ignored I think.

note: unsure where the affinity template updates came from, but they get generated with the gen makefile target. maybe from kagent-dev#1085, but surprised it's generating on my pr 3 weeks after the merge 🤔 # Changes - Hashing secret alongside config-hash annotation for agent pod, so when a referenced secret updates, it restarts - Added `SecretHash` status on ModelConfig so that changes to underlying referenced secrets are propagated (resource version updates) to Agent reconciliation <img width="2067" height="1464" alt="image" src="https://github.com/user-attachments/assets/a1b74d88-17f8-45fd-b334-cc1f2553a47f" /> With these changes… 1. When a Secret updates, a ModelConfig will update its status to reflect new hash. 2. ModelConfig updates resource version 3. The agent watching over modelConfig sees resource update 4. Agent reconciles, updating the annotation on the pod. 5. Agent restarts, loading in new secrets ## Golden Test Changes - Notes The outputs for golden test annotations have _not_ changed, because the annotation hash relies on the modelconfig status which has Secret updates (hash). Modelconfig needs to reconcile for status, and does not reconcile in test, so `byte{}` (no change) is written to the hash. # Context With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll need a UX-friendly way for agents to update with the latest Secret (e.g. cert rotation, api key change) without requiring users to manually restart the agent. Note: We can’t rely on dynamic volume mounting, as the ca cert is read on agent start so that it configures the cached client. The api key also needed a way for its update to propagate to the agent. ## Demo _steps_ [agent restart validation steps.md](https://github.com/user-attachments/files/23664735/agent.restart.validation.steps.md) _video_ https://github.com/user-attachments/assets/eca62fb4-2ca2-45eb-94ba-7dfd0db5244b ## Alternative Solutions _feedback wanted_ ### Per-Secret Status Instead of hashing all secrets into a single final hash to store in the ModelConfig’s status, we could store a status per-Secret. For example, the status would change from: ```yaml status: […] SecretHash: XYZ ``` to something like ```yaml status: […] Secrets: APIKey: Hash/Version: 123 TLS: Hash/Version: 123 ``` I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates. We would need to see _how_ we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now. ### Avoiding Restart Requirement We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used. Pros: - Avoiding restart requirement Cons: - Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy) - We won’t be able to do any validation checks like we do now on startup. --- Resolves kagent-dev#1091 --------- Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>

inFocus7 added 2 commits November 20, 2025 17:37

add pod annotation for agent secret restart + status to propagate fro…

dc42ed1

…m modelconfig to agent Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

gen

0b16949

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

golden test update

43d6f58

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

inFocus7 marked this pull request as ready for review November 21, 2025 17:32

Copilot AI review requested due to automatic review settings November 21, 2025 17:32

inFocus7 requested review from EItanya, ilackarms and yuval-k as code owners November 21, 2025 17:32

Copilot started reviewing on behalf of inFocus7 November 21, 2025 17:33 View session

Copilot finished reviewing on behalf of inFocus7 November 21, 2025 17:34

Copilot AI reviewed Nov 21, 2025

View reviewed changes

go/internal/controller/reconciler/reconciler.go Show resolved Hide resolved

go/internal/controller/reconciler/reconciler.go Show resolved Hide resolved

go/internal/controller/translator/agent/adk_api_translator.go Outdated Show resolved Hide resolved

inFocus7 added 2 commits November 21, 2025 13:09

remove todo comment

11f9817

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

aggregate multiple errors for modelconfig reconciliation

a08357e

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

EItanya requested changes Nov 25, 2025

View reviewed changes

inFocus7 added 2 commits November 25, 2025 12:59

set up secret in config-hash annotation

927d6b9

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

remove unused ModelConfigSecretHash from resolvedDeployment struct

16304a3

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>

inFocus7 requested a review from EItanya November 25, 2025 18:23

EItanya reviewed Dec 1, 2025

View reviewed changes

EItanya approved these changes Dec 1, 2025

View reviewed changes

EItanya merged commit f2461bb into kagent-dev:main Dec 1, 2025
19 checks passed

Restart Agents Automatically On Secret Updates #1121

Restart Agents Automatically On Secret Updates #1121

Uh oh!

Conversation

inFocus7 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Golden Test Changes - Notes

Context

Demo

Alternative Solutions

Per-Secret Status

Avoiding Restart Requirement

Uh oh!

inFocus7 commented Nov 21, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EItanya commented Nov 24, 2025

Uh oh!

onematchfox commented Nov 25, 2025

Uh oh!

EItanya commented Nov 25, 2025

Uh oh!

EItanya left a comment

Choose a reason for hiding this comment

Uh oh!

inFocus7 commented Nov 25, 2025

Uh oh!

EItanya left a comment

Choose a reason for hiding this comment

Uh oh!

EItanya Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

inFocus7 Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EItanya Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

inFocus7 commented Nov 20, 2025 •

edited

Loading

inFocus7 Dec 1, 2025 •

edited

Loading