Skip to content

Conversation

@inFocus7
Copy link
Contributor

@inFocus7 inFocus7 commented Nov 20, 2025

note: unsure where the affinity template updates came from, but they get generated with the gen makefile target. maybe from #1085, but surprised it's generating on my pr 3 weeks after the merge 🤔

Changes

  • Hashing secret alongside config-hash annotation for agent pod, so when a referenced secret updates, it restarts
  • Added SecretHash status on ModelConfig so that changes to underlying referenced secrets are propagated (resource version updates) to Agent reconciliation
image

With these changes…

  1. When a Secret updates, a ModelConfig will update its status to reflect new hash.
  2. ModelConfig updates resource version
  3. The agent watching over modelConfig sees resource update
  4. Agent reconciles, updating the annotation on the pod.
  5. Agent restarts, loading in new secrets

Golden Test Changes - Notes

The outputs for golden test annotations have not changed, because the annotation hash relies on the modelconfig status which has Secret updates (hash). Modelconfig needs to reconcile for status, and does not reconcile in test, so byte{} (no change) is written to the hash.

Context

With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll need a UX-friendly way for agents to update with the latest Secret (e.g. cert rotation, api key change) without requiring users to manually restart the agent.

Note: We can’t rely on dynamic volume mounting, as the ca cert is read on agent start so that it configures the cached client. The api key also needed a way for its update to propagate to the agent.

Demo

steps

agent restart validation steps.md

video

agent-restart-work.mp4

Alternative Solutions

feedback wanted

Per-Secret Status

Instead of hashing all secrets into a single final hash to store in the ModelConfig’s status, we could store a status per-Secret.

For example, the status would change from:
```yaml
status:
[…]
SecretHash: XYZ


to something like
```yaml
status:
  […]
  Secrets:
    APIKey:
      Hash/Version: 123
    TLS:
      Hash/Version: 123

I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates.

We would need to see how we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now.

Avoiding Restart Requirement

We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used.

Pros:

  • Avoiding restart requirement
    Cons:
  • Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy)
  • We won’t be able to do any validation checks like we do now on startup.

Resolves #1091

…m modelconfig to agent

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
@inFocus7
Copy link
Contributor Author

Ah golden test failures due to the new hash annotation. Makes sense, will fix.

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
@inFocus7 inFocus7 marked this pull request as ready for review November 21, 2025 17:32
Copilot AI review requested due to automatic review settings November 21, 2025 17:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements automatic agent restarts when secrets referenced by ModelConfigs are updated. The solution involves computing a hash of referenced secrets (API key and TLS CA certificates) and storing it in the ModelConfig status. This hash is then propagated to agent pod annotations, triggering pod restarts when secrets change.

Key Changes

  • Added SecretHash field to ModelConfig status to track changes in referenced secrets
  • Modified agent translator to include model config secret hash in pod annotations
  • Updated ModelConfig reconciler to compute and track secret hashes, triggering status updates
  • Added test coverage for the hash computation function ensuring deterministic output

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
go/api/v1alpha2/modelconfig_types.go Added SecretHash field to ModelConfigStatus
go/internal/controller/reconciler/reconciler.go Implemented secret hash computation and status tracking logic
go/internal/controller/reconciler/reconciler_test.go Added comprehensive tests for hash computation
go/internal/controller/translator/agent/adk_api_translator.go Updated agent translation to propagate secret hash to pod annotations
go/internal/controller/modelconfig_controller.go Enhanced secret reference checking to include TLS certificates
go/config/crd/bases/kagent.dev_modelconfigs.yaml Added secretHash field to CRD schema and duplicate validation rule
helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml Updated Helm CRD template with same changes as base CRD
helm/kagent-crds/templates/kagent.dev_agents.yaml Added affinity, nodeSelector, and tolerations fields (auto-generated)
go/internal/controller/translator/agent/testdata/outputs/*.json Updated golden test outputs with empty secret-hash annotation
Comments suppressed due to low confidence (2)

go/config/crd/bases/kagent.dev_modelconfigs.yaml:652

  • Duplicate validation rule: Lines 643-646 and 647-652 both validate that caCertSecretKey requires caCertSecretRef. The second rule adds the condition "(unless disableVerify is true)" but appears to be redundant with the first rule. The first rule on lines 643-646 should either be removed or the logic should be consolidated into a single validation rule.
            - message: caCertSecretKey requires caCertSecretRef
              rule: '!(has(self.tls) && has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
                > 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
                == 0))'
            - message: caCertSecretKey requires caCertSecretRef (unless disableVerify
                is true)
              rule: '!(has(self.tls) && (!has(self.tls.disableVerify) || !self.tls.disableVerify)
                && has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
                > 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
                == 0))'

helm/kagent-crds/templates/kagent.dev_modelconfigs.yaml:652

  • Duplicate validation rule: Lines 643-646 and 647-652 both validate that caCertSecretKey requires caCertSecretRef. The second rule adds the condition "(unless disableVerify is true)" but appears to be redundant with the first rule. The first rule on lines 643-646 should either be removed or the logic should be consolidated into a single validation rule.
            - message: caCertSecretKey requires caCertSecretRef
              rule: '!(has(self.tls) && has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
                > 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
                == 0))'
            - message: caCertSecretKey requires caCertSecretRef (unless disableVerify
                is true)
              rule: '!(has(self.tls) && (!has(self.tls.disableVerify) || !self.tls.disableVerify)
                && has(self.tls.caCertSecretKey) && size(self.tls.caCertSecretKey)
                > 0 && (!has(self.tls.caCertSecretRef) || size(self.tls.caCertSecretRef)
                == 0))'

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
@EItanya
Copy link
Contributor

EItanya commented Nov 24, 2025

Why is this change necessary? The way that secrets are mounted into the agent pods is all via filesystem mounts, can't we just watch for updates on those files?

@onematchfox
Copy link
Contributor

The way that secrets are mounted into the agent pods is all via filesystem mounts, can't we just watch for updates on those files?

This feels like the right way of doing this to me at least. A file watcher would need to be implemented within the agent itself; but also doesn't cater for all use cases - API keys are not mounted as files at least.

@EItanya
Copy link
Contributor

EItanya commented Nov 25, 2025

The way that secrets are mounted into the agent pods is all via filesystem mounts, can't we just watch for updates on those files?

This feels like the right way of doing this to me at least. A file watcher would need to be implemented within the agent itself; but also doesn't cater for all use cases - API keys are not mounted as files at least.

The more I thought about it the more I think you are correct actually, going to re-review this today

Copy link
Contributor

@EItanya EItanya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment here is that I think a separate secretHash is probably unnecessary. configHash is meant to represent the state of all inputs which created this particular iteration of an agent, therefore I think we should just add this to the existing hash rather than creating a new one. What do you think?

@inFocus7
Copy link
Contributor Author

(copied from earlier message)

They're similar, but the existing config hash annotation is the hashed adk.AgentConfig which doesn't include any model information (aside from model type).

It's why i explicitly added agent to parameters, since there's also "ModelConfig", and the generic "config" naming confused me for a second

Alternatives:

  • I could create a new structure for this to also include secret data and hash it all together {AgentConfig, SecretData).
  • I could expand the AgentConfig to add secret data, but this config defined in both go & python, so need to look into what this impacts overall

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
@inFocus7 inFocus7 requested a review from EItanya November 25, 2025 18:23
Copy link
Contributor

@EItanya EItanya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking great, just the one last comment

Conditions []metav1.Condition `json:"conditions"`
ObservedGeneration int64 `json:"observedGeneration"`
// The secret hash stores a hash of any secrets required by the model config (i.e. api key, tls cert) to ensure agents referencing this model config detect changes to these secrets and restart if necessary.
SecretHash string `json:"secretHash,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the main reconciler will run for secret changes anyway, so this isn't strictly necessary. Also, right now the agent reconciler doesn't wait for ModelConfig resources to be accepted, which would have to change in order for this to work. For now can we just move the Hash computation into the agent reconciler directly and then we can separate it later if we need to?

Copy link
Contributor Author

@inFocus7 inFocus7 Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So prefer the change below:?

Current Updates

  1. Agent watches ModelConfigs
  2. ModelConfig watches Secrets
  3. ModelConfig reconciles if Secret in its spec.TLS/API updates
  4. ModelConfig updates Status using Secret hash (new)
  5. Agent reconciles due to update of ModelConfig
  6. Agent uses ModelConfig status for pod annotation (new)

"An Agent reconciles because its ModelConfig updates. A ModelConfig updates because a Secret it uses updates."

New Implementation

  1. Agent watches ModelConfigs and Secrets (new)
    • Because the Agent is now also watching Secret, should we update its status with an error if a secret DNE? Or leave that for the ModelConfig resource?
  2. Agent reconciles if a Secret in its spec.ModelConfig.TLS/API updates (new)
  3. Agent uses calculated Secret hash for pod annotation (new)

"An Agent reconciles because a Secret in its ModelConfig updates."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I was mistaken about the current code. I assumed that the agent already watched secrets, but it does not, which is I guess part of the bug being fixed 😆. In which case the current code is correct, and this comment can be ignored I think.

@EItanya EItanya merged commit f2461bb into kagent-dev:main Dec 1, 2025
19 checks passed
GTRekter pushed a commit to GTRekter/kagent that referenced this pull request Dec 4, 2025
note: unsure where the affinity template updates came from, but they get
generated with the gen makefile target. maybe from kagent-dev#1085, but surprised
it's generating on my pr 3 weeks after the merge 🤔

# Changes
- Hashing secret alongside config-hash annotation for agent pod, so when
a referenced secret updates, it restarts
- Added `SecretHash` status on ModelConfig so that changes to underlying
referenced secrets are propagated (resource version updates) to Agent
reconciliation

<img width="2067" height="1464" alt="image"
src="https://github.com/user-attachments/assets/a1b74d88-17f8-45fd-b334-cc1f2553a47f"
/>

With these changes…
1. When a Secret updates, a ModelConfig will update its status to
reflect new hash.
2. ModelConfig updates resource version
3. The agent watching over modelConfig sees resource update
4. Agent reconciles, updating the annotation on the pod.
5. Agent restarts, loading in new secrets

## Golden Test Changes - Notes

The outputs for golden test annotations have _not_ changed, because the
annotation hash relies on the modelconfig status which has Secret
updates (hash). Modelconfig needs to reconcile for status, and does not
reconcile in test, so `byte{}` (no change) is written to the hash.

# Context

With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll
need a UX-friendly way for agents to update with the latest Secret (e.g.
cert rotation, api key change) without requiring users to manually
restart the agent.

Note: We can’t rely on dynamic volume mounting, as the ca cert is read
on agent start so that it configures the cached client. The api key also
needed a way for its update to propagate to the agent.

## Demo

_steps_

[agent restart validation
steps.md](https://github.com/user-attachments/files/23664735/agent.restart.validation.steps.md)

_video_

https://github.com/user-attachments/assets/eca62fb4-2ca2-45eb-94ba-7dfd0db5244b

## Alternative Solutions

_feedback wanted_

### Per-Secret Status

Instead of hashing all secrets into a single final hash to store in the
ModelConfig’s status, we could store a status per-Secret.

For example, the status would change from:
```yaml
status:
    […]
    SecretHash: XYZ
```

to something like
```yaml
status:
  […]
  Secrets:
    APIKey:
      Hash/Version: 123
    TLS:
      Hash/Version: 123
```

I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates.

We would need to see _how_ we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now.

### Avoiding Restart Requirement

We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used.

Pros:
- Avoiding restart requirement
Cons:
- Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy)
- We won’t be able to do any validation checks like we do now on startup.

---

Resolves kagent-dev#1091

---------

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatic agent redeployment when TLS secret changes

3 participants