-
Notifications
You must be signed in to change notification settings - Fork 352
feat(controller): Add support for Agent tolerations and affinities #1085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(controller): Add support for Agent tolerations and affinities #1085
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for Kubernetes scheduling attributes (tolerations, affinity, and nodeSelector) to agent deployments in the kagent system. These attributes allow users to control pod placement on nodes based on constraints, preferences, and taints/tolerations.
Key changes:
- Added three new optional fields to
SharedDeploymentSpec:Tolerations,Affinity, andNodeSelector - Updated the translator to propagate these scheduling attributes from the API spec to the Kubernetes Deployment manifest
- Generated CRD updates to include the new scheduling fields with full Kubernetes API definitions
Reviewed Changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
go/api/v1alpha2/agent_types.go |
Added three optional scheduling fields to SharedDeploymentSpec struct |
go/api/v1alpha2/zz_generated.deepcopy.go |
Auto-generated deep copy methods for the new scheduling fields |
go/internal/controller/translator/agent/adk_api_translator.go |
Updated translator to copy scheduling attributes from spec to resolved deployment and apply them to the pod template |
go/config/crd/bases/kagent.dev_agents.yaml |
Auto-generated CRD schema updates with full Kubernetes scheduling API definitions |
go/go.mod |
Moved two dependencies from indirect to direct |
go/go.sum |
Added dependency entry for mattn/go-runewidth |
go/internal/controller/translator/agent/testdata/inputs/agent_with_scheduling_attributes.yaml |
New test input with scheduling attributes configuration |
go/internal/controller/translator/agent/testdata/outputs/agent_with_scheduling_attributes.json |
Expected output showing scheduling attributes applied to deployment manifest |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Dhouti <l.rex.via@gmail.com>
dc3a8ad to
b301aba
Compare
EItanya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially got scared by the sheer size of this PR, but after looking at it I realized it's actually quite simple and clean! Thanks so much for the PR :)
note: unsure where the affinity template updates came from, but they get generated with the gen makefile target. maybe from #1085, but surprised it's generating on my pr 3 weeks after the merge 🤔 # Changes - Hashing secret alongside config-hash annotation for agent pod, so when a referenced secret updates, it restarts - Added `SecretHash` status on ModelConfig so that changes to underlying referenced secrets are propagated (resource version updates) to Agent reconciliation <img width="2067" height="1464" alt="image" src="https://github.com/user-attachments/assets/a1b74d88-17f8-45fd-b334-cc1f2553a47f" /> With these changes… 1. When a Secret updates, a ModelConfig will update its status to reflect new hash. 2. ModelConfig updates resource version 3. The agent watching over modelConfig sees resource update 4. Agent reconciles, updating the annotation on the pod. 5. Agent restarts, loading in new secrets ## Golden Test Changes - Notes The outputs for golden test annotations have _not_ changed, because the annotation hash relies on the modelconfig status which has Secret updates (hash). Modelconfig needs to reconcile for status, and does not reconcile in test, so `byte{}` (no change) is written to the hash. # Context With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll need a UX-friendly way for agents to update with the latest Secret (e.g. cert rotation, api key change) without requiring users to manually restart the agent. Note: We can’t rely on dynamic volume mounting, as the ca cert is read on agent start so that it configures the cached client. The api key also needed a way for its update to propagate to the agent. ## Demo _steps_ [agent restart validation steps.md](https://github.com/user-attachments/files/23664735/agent.restart.validation.steps.md) _video_ https://github.com/user-attachments/assets/eca62fb4-2ca2-45eb-94ba-7dfd0db5244b ## Alternative Solutions _feedback wanted_ ### Per-Secret Status Instead of hashing all secrets into a single final hash to store in the ModelConfig’s status, we could store a status per-Secret. For example, the status would change from: ```yaml status: […] SecretHash: XYZ ``` to something like ```yaml status: […] Secrets: APIKey: Hash/Version: 123 TLS: Hash/Version: 123 ``` I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates. We would need to see _how_ we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now. ### Avoiding Restart Requirement We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used. Pros: - Avoiding restart requirement Cons: - Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy) - We won’t be able to do any validation checks like we do now on startup. --- Resolves #1091 --------- Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
note: unsure where the affinity template updates came from, but they get generated with the gen makefile target. maybe from kagent-dev#1085, but surprised it's generating on my pr 3 weeks after the merge 🤔 # Changes - Hashing secret alongside config-hash annotation for agent pod, so when a referenced secret updates, it restarts - Added `SecretHash` status on ModelConfig so that changes to underlying referenced secrets are propagated (resource version updates) to Agent reconciliation <img width="2067" height="1464" alt="image" src="https://github.com/user-attachments/assets/a1b74d88-17f8-45fd-b334-cc1f2553a47f" /> With these changes… 1. When a Secret updates, a ModelConfig will update its status to reflect new hash. 2. ModelConfig updates resource version 3. The agent watching over modelConfig sees resource update 4. Agent reconciles, updating the annotation on the pod. 5. Agent restarts, loading in new secrets ## Golden Test Changes - Notes The outputs for golden test annotations have _not_ changed, because the annotation hash relies on the modelconfig status which has Secret updates (hash). Modelconfig needs to reconcile for status, and does not reconcile in test, so `byte{}` (no change) is written to the hash. # Context With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll need a UX-friendly way for agents to update with the latest Secret (e.g. cert rotation, api key change) without requiring users to manually restart the agent. Note: We can’t rely on dynamic volume mounting, as the ca cert is read on agent start so that it configures the cached client. The api key also needed a way for its update to propagate to the agent. ## Demo _steps_ [agent restart validation steps.md](https://github.com/user-attachments/files/23664735/agent.restart.validation.steps.md) _video_ https://github.com/user-attachments/assets/eca62fb4-2ca2-45eb-94ba-7dfd0db5244b ## Alternative Solutions _feedback wanted_ ### Per-Secret Status Instead of hashing all secrets into a single final hash to store in the ModelConfig’s status, we could store a status per-Secret. For example, the status would change from: ```yaml status: […] SecretHash: XYZ ``` to something like ```yaml status: […] Secrets: APIKey: Hash/Version: 123 TLS: Hash/Version: 123 ``` I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates. We would need to see _how_ we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now. ### Avoiding Restart Requirement We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used. Pros: - Avoiding restart requirement Cons: - Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy) - We won’t be able to do any validation checks like we do now on startup. --- Resolves kagent-dev#1091 --------- Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io> Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Agent configurations lack scheduling knobs.
Adds
Tolerations,Affinity, &NodeSelectoroptions to theAgentCRD.Example: