-
Notifications
You must be signed in to change notification settings - Fork 352
Restart Agents Automatically On Secret Updates #1121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart Agents Automatically On Secret Updates #1121
Conversation
…m modelconfig to agent Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
|
Ah golden test failures due to the new hash annotation. Makes sense, will fix. |
Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements automatic agent restarts when secrets referenced by ModelConfigs are updated. The solution involves computing a hash of referenced secrets (API key and TLS CA certificates) and storing it in the ModelConfig status. This hash is then propagated to agent pod annotations, triggering pod restarts when secrets change.
Key Changes
- Added
SecretHashfield to ModelConfig status to track changes in referenced secrets - Modified agent translator to include model config secret hash in pod annotations
- Updated ModelConfig reconciler to compute and track secret hashes, triggering status updates
- Added test coverage for the hash computation function ensuring deterministic output
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| go/api/v1alpha2/modelconfig_types.go | Added SecretHash field to ModelConfigStatus |
| go/internal/controller/reconciler/reconciler.go | Implemented secret hash computation and status tracking logic |
| go/internal/controller/reconciler/reconciler_test.go | Added comprehensive tests for hash computation |
| go/internal/controller/translator/agent/adk_api_translator.go | Updated agent translation to propagate secret hash to pod annotations |
| go/internal/controller/modelconfig_controller.go | Enhanced secret reference checking to include TLS certificates |
| go/config/crd/bases/kagent.dev_modelconfigs.yaml | Added secretHash field to CRD schema and duplicate validation rule |
| helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml | Updated Helm CRD template with same changes as base CRD |
| helm/kagent-crds/templates/kagent.dev_agents.yaml | Added affinity, nodeSelector, and tolerations fields (auto-generated) |
| go/internal/controller/translator/agent/testdata/outputs/*.json | Updated golden test outputs with empty secret-hash annotation |
Comments suppressed due to low confidence (2)
go/config/crd/bases/kagent.dev_modelconfigs.yaml:652
- Duplicate validation rule: Lines 643-646 and 647-652 both validate that caCertSecretKey requires caCertSecretRef. The second rule adds the condition "(unless disableVerify is true)" but appears to be redundant with the first rule. The first rule on lines 643-646 should either be removed or the logic should be consolidated into a single validation rule.
- message: caCertSecretKey requires caCertSecretRef
rule: '!(has(self.tls) && has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
> 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
== 0))'
- message: caCertSecretKey requires caCertSecretRef (unless disableVerify
is true)
rule: '!(has(self.tls) && (!has(self.tls.disableVerify) || !self.tls.disableVerify)
&& has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
> 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
== 0))'
helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml:652
- Duplicate validation rule: Lines 643-646 and 647-652 both validate that caCertSecretKey requires caCertSecretRef. The second rule adds the condition "(unless disableVerify is true)" but appears to be redundant with the first rule. The first rule on lines 643-646 should either be removed or the logic should be consolidated into a single validation rule.
- message: caCertSecretKey requires caCertSecretRef
rule: '!(has(self.tls) && has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
> 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
== 0))'
- message: caCertSecretKey requires caCertSecretRef (unless disableVerify
is true)
rule: '!(has(self.tls) && (!has(self.tls.disableVerify) || !self.tls.disableVerify)
&& has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
> 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
== 0))'
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
|
Why is this change necessary? The way that secrets are mounted into the agent pods is all via filesystem mounts, can't we just watch for updates on those files? |
This feels like the right way of doing this to me at least. A file watcher would need to be implemented within the agent itself; but also doesn't cater for all use cases - API keys are not mounted as files at least. |
The more I thought about it the more I think you are correct actually, going to re-review this today |
EItanya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main comment here is that I think a separate secretHash is probably unnecessary. configHash is meant to represent the state of all inputs which created this particular iteration of an agent, therefore I think we should just add this to the existing hash rather than creating a new one. What do you think?
|
(copied from earlier message) They're similar, but the existing config hash annotation is the hashed It's why i explicitly added Alternatives:
|
Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
EItanya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is looking great, just the one last comment
| Conditions []metav1.Condition `json:"conditions"` | ||
| ObservedGeneration int64 `json:"observedGeneration"` | ||
| // The secret hash stores a hash of any secrets required by the model config (i.e. api key, tls cert) to ensure agents referencing this model config detect changes to these secrets and restart if necessary. | ||
| SecretHash string `json:"secretHash,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the main reconciler will run for secret changes anyway, so this isn't strictly necessary. Also, right now the agent reconciler doesn't wait for ModelConfig resources to be accepted, which would have to change in order for this to work. For now can we just move the Hash computation into the agent reconciler directly and then we can separate it later if we need to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So prefer the change below:?
Current Updates
- Agent watches ModelConfigs
- ModelConfig watches Secrets
- ModelConfig reconciles if Secret in its spec.TLS/API updates
- ModelConfig updates Status using Secret hash (new)
- Agent reconciles due to update of ModelConfig
- Agent uses ModelConfig status for pod annotation (new)
"An Agent reconciles because its ModelConfig updates. A ModelConfig updates because a Secret it uses updates."
New Implementation
- Agent watches ModelConfigs and Secrets (new)
- Because the Agent is now also watching Secret, should we update its status with an error if a secret DNE? Or leave that for the ModelConfig resource?
- Agent reconciles if a Secret in its spec.ModelConfig.TLS/API updates (new)
- Agent uses calculated Secret hash for pod annotation (new)
"An Agent reconciles because a Secret in its ModelConfig updates."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I was mistaken about the current code. I assumed that the agent already watched secrets, but it does not, which is I guess part of the bug being fixed 😆. In which case the current code is correct, and this comment can be ignored I think.
note: unsure where the affinity template updates came from, but they get generated with the gen makefile target. maybe from kagent-dev#1085, but surprised it's generating on my pr 3 weeks after the merge 🤔 # Changes - Hashing secret alongside config-hash annotation for agent pod, so when a referenced secret updates, it restarts - Added `SecretHash` status on ModelConfig so that changes to underlying referenced secrets are propagated (resource version updates) to Agent reconciliation <img width="2067" height="1464" alt="image" src="https://github.com/user-attachments/assets/a1b74d88-17f8-45fd-b334-cc1f2553a47f" /> With these changes… 1. When a Secret updates, a ModelConfig will update its status to reflect new hash. 2. ModelConfig updates resource version 3. The agent watching over modelConfig sees resource update 4. Agent reconciles, updating the annotation on the pod. 5. Agent restarts, loading in new secrets ## Golden Test Changes - Notes The outputs for golden test annotations have _not_ changed, because the annotation hash relies on the modelconfig status which has Secret updates (hash). Modelconfig needs to reconcile for status, and does not reconcile in test, so `byte{}` (no change) is written to the hash. # Context With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll need a UX-friendly way for agents to update with the latest Secret (e.g. cert rotation, api key change) without requiring users to manually restart the agent. Note: We can’t rely on dynamic volume mounting, as the ca cert is read on agent start so that it configures the cached client. The api key also needed a way for its update to propagate to the agent. ## Demo _steps_ [agent restart validation steps.md](https://github.com/user-attachments/files/23664735/agent.restart.validation.steps.md) _video_ https://github.com/user-attachments/assets/eca62fb4-2ca2-45eb-94ba-7dfd0db5244b ## Alternative Solutions _feedback wanted_ ### Per-Secret Status Instead of hashing all secrets into a single final hash to store in the ModelConfig’s status, we could store a status per-Secret. For example, the status would change from: ```yaml status: […] SecretHash: XYZ ``` to something like ```yaml status: […] Secrets: APIKey: Hash/Version: 123 TLS: Hash/Version: 123 ``` I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates. We would need to see _how_ we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now. ### Avoiding Restart Requirement We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used. Pros: - Avoiding restart requirement Cons: - Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy) - We won’t be able to do any validation checks like we do now on startup. --- Resolves kagent-dev#1091 --------- Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
note: unsure where the affinity template updates came from, but they get generated with the gen makefile target. maybe from #1085, but surprised it's generating on my pr 3 weeks after the merge 🤔
Changes
SecretHashstatus on ModelConfig so that changes to underlying referenced secrets are propagated (resource version updates) to Agent reconciliationWith these changes…
Golden Test Changes - Notes
The outputs for golden test annotations have not changed, because the annotation hash relies on the modelconfig status which has Secret updates (hash). Modelconfig needs to reconcile for status, and does not reconcile in test, so
byte{}(no change) is written to the hash.Context
With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll need a UX-friendly way for agents to update with the latest Secret (e.g. cert rotation, api key change) without requiring users to manually restart the agent.
Note: We can’t rely on dynamic volume mounting, as the ca cert is read on agent start so that it configures the cached client. The api key also needed a way for its update to propagate to the agent.
Demo
steps
agent restart validation steps.md
video
agent-restart-work.mp4
Alternative Solutions
feedback wanted
Per-Secret Status
Instead of hashing all secrets into a single final hash to store in the ModelConfig’s status, we could store a status per-Secret.
For example, the status would change from: ```yaml
status:
[…]
SecretHash: XYZ
I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates.
We would need to see how we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now.
Avoiding Restart Requirement
We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used.
Pros:
Cons:
Resolves #1091