Adding SGP-related extensible configuration to the model engine helm chart#762
Adding SGP-related extensible configuration to the model engine helm chart#762arniechops wants to merge 6 commits intomainfrom
Conversation
Additional Comments (1)
This PR parameterizes ubuntu, busybox, and curl images via Consider adding a Prompt To Fix With AIThis is a comment left during a code review.
Path: charts/model-engine/templates/proportional_a100_autoscaler_deployment.yaml
Line: 30
Comment:
**Hardcoded image breaks airgapped deployments**
This PR parameterizes ubuntu, busybox, and curl images via `utilityImages` for airgapped/on-prem environments, but the `cluster-proportional-autoscaler` image from `registry.k8s.io` remains hardcoded here and in the a10/t4 autoscaler deployments. In an airgapped environment, `registry.k8s.io` would be unreachable just like the other registries that were parameterized.
Consider adding a `utilityImages.clusterProportionalAutoscaler` entry (or similar) to `values.yaml` and referencing it here:
```suggestion
- image: {{ .Values.utilityImages.clusterProportionalAutoscaler.repository }}:{{ .Values.utilityImages.clusterProportionalAutoscaler.tag }}
```
How can I resolve this? If you propose a fix, please make it concise. |
| {{- with $.Values.automountServiceAccountToken }} | ||
| automountServiceAccountToken: {{ . }} | ||
| {{- end }} |
There was a problem hiding this comment.
with skips block when value is false
Helm's with treats false as falsy, so when automountServiceAccountToken is set to false (the exact scenario where you want to disable automount on the ServiceAccount), this block will be skipped entirely and automountServiceAccountToken won't be rendered into the YAML. Only when the value is true will it appear — which is the opposite of the critical behavior.
In contrast, the pod-level specs in other templates correctly use {{ .Values.automountServiceAccountToken }} directly without a with gate.
| {{- with $.Values.automountServiceAccountToken }} | |
| automountServiceAccountToken: {{ . }} | |
| {{- end }} | |
| automountServiceAccountToken: {{ $.Values.automountServiceAccountToken }} |
Prompt To Fix With AI
This is a comment left during a code review.
Path: charts/model-engine/templates/service_account.yaml
Line: 24-26
Comment:
**`with` skips block when value is `false`**
Helm's `with` treats `false` as falsy, so when `automountServiceAccountToken` is set to `false` (the exact scenario where you want to disable automount on the ServiceAccount), this block will be skipped entirely and `automountServiceAccountToken` won't be rendered into the YAML. Only when the value is `true` will it appear — which is the opposite of the critical behavior.
In contrast, the pod-level specs in other templates correctly use `{{ .Values.automountServiceAccountToken }}` directly without a `with` gate.
```suggestion
automountServiceAccountToken: {{ $.Values.automountServiceAccountToken }}
```
How can I resolve this? If you propose a fix, please make it concise.| {{- include "modelEngine.labels" . | nindent 4 }} | ||
| annotations: | ||
| "helm.sh/hook": pre-install,pre-upgrade | ||
| "helm.sh/hook-weight": "-1" |
There was a problem hiding this comment.
Hook weight collision with migration job
Both database_init_job.yaml and database_migration_job.yaml use "helm.sh/hook-weight": "-1" on pre-install,pre-upgrade. This means Helm does not guarantee execution order between them. Since init_database.py creates schemas (create schema if not exists) and tables (Base.metadata.create_all), and the migration job applies Alembic migrations that depend on those schemas existing, the migration could run first and fail.
Consider setting this job's hook weight to "-2" (or lower) so it is guaranteed to run before the migration job at weight "-1".
| "helm.sh/hook-weight": "-1" | |
| "helm.sh/hook-weight": "-2" |
Prompt To Fix With AI
This is a comment left during a code review.
Path: charts/model-engine/templates/database_init_job.yaml
Line: 10
Comment:
**Hook weight collision with migration job**
Both `database_init_job.yaml` and `database_migration_job.yaml` use `"helm.sh/hook-weight": "-1"` on `pre-install,pre-upgrade`. This means Helm does not guarantee execution order between them. Since `init_database.py` creates schemas (`create schema if not exists`) and tables (`Base.metadata.create_all`), and the migration job applies Alembic migrations that depend on those schemas existing, the migration could run first and fail.
Consider setting this job's hook weight to `"-2"` (or lower) so it is guaranteed to run before the migration job at weight `"-1"`.
```suggestion
"helm.sh/hook-weight": "-2"
```
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (2)
All other pod specs in this PR (gateway, builder, cacher, celery autoscaler, database migration, database init, spellbook init, and the proportional autoscalers) explicitly set When Prompt To Fix With AIThis is a comment left during a code review.
Path: charts/model-engine/templates/populate_fine_tuning_repository_job.yaml
Line: 44
Comment:
**Missing `automountServiceAccountToken` setting**
All other pod specs in this PR (gateway, builder, cacher, celery autoscaler, database migration, database init, spellbook init, and the proportional autoscalers) explicitly set `automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}`. This job and `restart_keda_operator.yaml` are the only ones missing it.
When `automountServiceAccountToken` is `false`, the `modelEngine.volumes` helper will inject the projected `token-volume`, but without disabling the default automount on the pod spec, Kubernetes will also mount its own default token — resulting in a conflict at `/var/run/secrets/kubernetes.io/serviceaccount`.
```suggestion
automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}
serviceAccountName: {{ include "modelEngine.fullname" . }}
```
How can I resolve this? If you propose a fix, please make it concise.
Same inconsistency as Prompt To Fix With AIThis is a comment left during a code review.
Path: charts/model-engine/templates/restart_keda_operator.yaml
Line: 43
Comment:
**Missing `automountServiceAccountToken` setting**
Same inconsistency as `populate_fine_tuning_repository_job.yaml` — this job is missing `automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}` which every other pod spec in this PR has. This will cause issues when `automountServiceAccountToken` is set to `false`.
```suggestion
automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}
serviceAccountName: {{ include "modelEngine.fullname" . }}
```
How can I resolve this? If you propose a fix, please make it concise. |
Pull Request Summary
What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.
Greptile Summary
This PR introduces extensibility hooks and multi-cloud support to the model engine Helm chart (version bumped to
0.2.0-beta.1), enabling deployment in airgapped, on-prem, and GCP environments alongside the existing AWS and Azure paths.extraEnvVars,extraVolumes,extraVolumeMounts, andextraPodEnvFromto allow wrapper charts to inject custom configuration without forking templatespublic.ecr.awsreferences for busybox, curl, and ubuntu withutilityImages.*values, enabling private registry mirrorsautomountServiceAccountTokentoggle across all deployments/jobs, with projected token volume fallback (CA cert, namespace, short-lived token) for environments requiring explicit token managementredis.authSecretName/redis.authSecretKeyfor secret-based Redis auth, prioritized over plaintextredis.authazure.inference_client_idfor inference service accountsRelease.Revisionto job names and addsbefore-hook-creationdelete policy across all hook jobs for idempotent upgradesservice_config_map.yamlto render boolean config values without quotes usingkindIs "bool"inferenceFramework.*values instead of hardcoded "latest"gpuTolerationsto balloon deployments and service templatesConfidence Score: 4/5
withgate issue onautomountServiceAccountTokeninservice_account.yaml(which silently drops the field when set tofalse) and the hook weight collision betweendatabase_init_jobanddatabase_migration_job. Both issues have been raised in prior review threads.charts/model-engine/templates/service_account.yaml(automountServiceAccountTokenwithgate) andcharts/model-engine/templates/database_init_job.yaml(hook weight ordering with migration job).Important Files Changed
automountServiceAccountTokento ServiceAccount resource usingwithgate. Thewithblock is falsy when value isfalse, meaning the field won't render when disabled — a known issue flagged in a previous review thread.kindIs "bool"check to render boolean config values without quotes, applied consistently across all four config sections (launch + infra in both namespaces).extraPodEnvFromfor inference pods, Release.Revision to job names, and capturesgpuTolerationsfrom values.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[values.yaml] -->|defaults| B[_helpers.tpl] A -->|automountServiceAccountToken| C{automount enabled?} C -->|true| D[Standard SA token mount] C -->|false| E[Projected token volume\nCA cert + namespace + short-lived token] B -->|serviceEnvGitTagFromHelmVar| F[Gateway / Builder / Cacher] B -->|volumes + volumeMounts| F A -->|extraEnvVars| B A -->|extraVolumes / extraVolumeMounts| B A -->|utilityImages.*| G[Balloon Deployments] A -->|utilityImages.*| H[Service Templates\nbusybox / curl] A -->|gpuTolerations| G A -->|gpuTolerations| H A -->|redis.authSecretName| I{Secret-based Redis auth?} I -->|yes| J[secretKeyRef for REDIS_AUTH_TOKEN] I -->|no| K{redis.auth set?} K -->|yes| L[Plaintext REDIS_AUTH_TOKEN] K -->|no| M[No Redis auth] A -->|inferenceFramework.*| N[inference_framework_config ConfigMap] A -->|extraPodEnvFrom| H A -->|config.values.infra.cloud_provider=gcp| O[Celery Autoscaler\nGCP Memorystore broker]Last reviewed commit: bf4155c