Skip to content

Conversation

@Dhouti
Copy link
Contributor

@Dhouti Dhouti commented Nov 5, 2025

Agent configurations lack scheduling knobs.

Adds Tolerations, Affinity, & NodeSelector options to the Agent CRD.
Example:

apiVersion: kagent.dev/v1alpha2
kind: Agent
metadata:
  name: example
  namespace: test
spec:
    deployment:
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
      - operator: Exists
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/os
                operator: In
                values:
                - linux

Copilot AI review requested due to automatic review settings November 5, 2025 05:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for Kubernetes scheduling attributes (tolerations, affinity, and nodeSelector) to agent deployments in the kagent system. These attributes allow users to control pod placement on nodes based on constraints, preferences, and taints/tolerations.

Key changes:

  • Added three new optional fields to SharedDeploymentSpec: Tolerations, Affinity, and NodeSelector
  • Updated the translator to propagate these scheduling attributes from the API spec to the Kubernetes Deployment manifest
  • Generated CRD updates to include the new scheduling fields with full Kubernetes API definitions

Reviewed Changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
go/api/v1alpha2/agent_types.go Added three optional scheduling fields to SharedDeploymentSpec struct
go/api/v1alpha2/zz_generated.deepcopy.go Auto-generated deep copy methods for the new scheduling fields
go/internal/controller/translator/agent/adk_api_translator.go Updated translator to copy scheduling attributes from spec to resolved deployment and apply them to the pod template
go/config/crd/bases/kagent.dev_agents.yaml Auto-generated CRD schema updates with full Kubernetes scheduling API definitions
go/go.mod Moved two dependencies from indirect to direct
go/go.sum Added dependency entry for mattn/go-runewidth
go/internal/controller/translator/agent/testdata/inputs/agent_with_scheduling_attributes.yaml New test input with scheduling attributes configuration
go/internal/controller/translator/agent/testdata/outputs/agent_with_scheduling_attributes.json Expected output showing scheduling attributes applied to deployment manifest

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Dhouti <l.rex.via@gmail.com>
@Dhouti Dhouti force-pushed the feature/controller-agent-tolerations-and-affinities branch from dc3a8ad to b301aba Compare November 5, 2025 05:57
Copy link
Contributor

@EItanya EItanya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially got scared by the sheer size of this PR, but after looking at it I realized it's actually quite simple and clean! Thanks so much for the PR :)

@EItanya EItanya merged commit 7ecca0f into kagent-dev:main Nov 6, 2025
16 checks passed
EItanya pushed a commit that referenced this pull request Dec 1, 2025
note: unsure where the affinity template updates came from, but they get
generated with the gen makefile target. maybe from #1085, but surprised
it's generating on my pr 3 weeks after the merge 🤔

# Changes
- Hashing secret alongside config-hash annotation for agent pod, so when
a referenced secret updates, it restarts
- Added `SecretHash` status on ModelConfig so that changes to underlying
referenced secrets are propagated (resource version updates) to Agent
reconciliation

<img width="2067" height="1464" alt="image"
src="https://github.com/user-attachments/assets/a1b74d88-17f8-45fd-b334-cc1f2553a47f"
/>

With these changes… 
1. When a Secret updates, a ModelConfig will update its status to
reflect new hash.
2. ModelConfig updates resource version
3. The agent watching over modelConfig sees resource update
4. Agent reconciles, updating the annotation on the pod.
5. Agent restarts, loading in new secrets

## Golden Test Changes - Notes

The outputs for golden test annotations have _not_ changed, because the
annotation hash relies on the modelconfig status which has Secret
updates (hash). Modelconfig needs to reconcile for status, and does not
reconcile in test, so `byte{}` (no change) is written to the hash.

# Context

With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll
need a UX-friendly way for agents to update with the latest Secret (e.g.
cert rotation, api key change) without requiring users to manually
restart the agent.

Note: We can’t rely on dynamic volume mounting, as the ca cert is read
on agent start so that it configures the cached client. The api key also
needed a way for its update to propagate to the agent.

## Demo

_steps_

[agent restart validation
steps.md](https://github.com/user-attachments/files/23664735/agent.restart.validation.steps.md)

_video_



https://github.com/user-attachments/assets/eca62fb4-2ca2-45eb-94ba-7dfd0db5244b


## Alternative Solutions

_feedback wanted_

### Per-Secret Status

Instead of hashing all secrets into a single final hash to store in the
ModelConfig’s status, we could store a status per-Secret.

For example, the status would change from:
```yaml
status:
    […]
    SecretHash: XYZ
```

to something like
```yaml
status:
  […]
  Secrets:
    APIKey:
      Hash/Version: 123
    TLS:
      Hash/Version: 123
```

I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates.

We would need to see _how_ we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now.

### Avoiding Restart Requirement

We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used.

Pros:
- Avoiding restart requirement
Cons:
- Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy)
- We won’t be able to do any validation checks like we do now on startup.

---

Resolves #1091

---------

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
GTRekter pushed a commit to GTRekter/kagent that referenced this pull request Dec 4, 2025
note: unsure where the affinity template updates came from, but they get
generated with the gen makefile target. maybe from kagent-dev#1085, but surprised
it's generating on my pr 3 weeks after the merge 🤔

# Changes
- Hashing secret alongside config-hash annotation for agent pod, so when
a referenced secret updates, it restarts
- Added `SecretHash` status on ModelConfig so that changes to underlying
referenced secrets are propagated (resource version updates) to Agent
reconciliation

<img width="2067" height="1464" alt="image"
src="https://github.com/user-attachments/assets/a1b74d88-17f8-45fd-b334-cc1f2553a47f"
/>

With these changes…
1. When a Secret updates, a ModelConfig will update its status to
reflect new hash.
2. ModelConfig updates resource version
3. The agent watching over modelConfig sees resource update
4. Agent reconciles, updating the annotation on the pod.
5. Agent restarts, loading in new secrets

## Golden Test Changes - Notes

The outputs for golden test annotations have _not_ changed, because the
annotation hash relies on the modelconfig status which has Secret
updates (hash). Modelconfig needs to reconcile for status, and does not
reconcile in test, so `byte{}` (no change) is written to the hash.

# Context

With the addition of TLS CA’s to ModelConfigs, it became apparent we’ll
need a UX-friendly way for agents to update with the latest Secret (e.g.
cert rotation, api key change) without requiring users to manually
restart the agent.

Note: We can’t rely on dynamic volume mounting, as the ca cert is read
on agent start so that it configures the cached client. The api key also
needed a way for its update to propagate to the agent.

## Demo

_steps_

[agent restart validation
steps.md](https://github.com/user-attachments/files/23664735/agent.restart.validation.steps.md)

_video_

https://github.com/user-attachments/assets/eca62fb4-2ca2-45eb-94ba-7dfd0db5244b

## Alternative Solutions

_feedback wanted_

### Per-Secret Status

Instead of hashing all secrets into a single final hash to store in the
ModelConfig’s status, we could store a status per-Secret.

For example, the status would change from:
```yaml
status:
    […]
    SecretHash: XYZ
```

to something like
```yaml
status:
  […]
  Secrets:
    APIKey:
      Hash/Version: 123
    TLS:
      Hash/Version: 123
```

I avoided this in order to simplify status tracking, less wordy compared to adding a field-per-secret - especially if we expand on referenced secrets in the future. But this manner does provide a better way for users to track where changes occurred exactly, and could avoid needing to do any hashing by using each secret’s resource version for updates.

We would need to see _how_ we’d propagate this to the agent pod annotations: adding annotation-per-secret vs. doing a singular hash for the pod like we do for the status now.

### Avoiding Restart Requirement

We should be able to avoid the restart needed for agents to configure the secrets. For instance, right now we mount a Volume for the TLS CA, and we use its file to configure the client at start which is cached. We could remove the client caching so that updated data from volume mounts are caught and used.

Pros:
- Avoiding restart requirement
Cons:
- Not caching the client would have some performance impact as it would need to be recreated per-call (maybe not a big deal, but noteworthy)
- We won’t be able to do any validation checks like we do now on startup.

---

Resolves kagent-dev#1091

---------

Signed-off-by: Fabian Gonzalez <fabian.gonzalez@solo.io>
Signed-off-by: Ivan Porta <porta.ivan@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants