Skip to content

Adding SGP-related extensible configuration to the model engine helm chart#762

Open
arniechops wants to merge 6 commits intomainfrom
arnavchopra/extensibility-hooks
Open

Adding SGP-related extensible configuration to the model engine helm chart#762
arniechops wants to merge 6 commits intomainfrom
arnavchopra/extensibility-hooks

Conversation

@arniechops
Copy link
Collaborator

@arniechops arniechops commented Feb 20, 2026

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

Greptile Summary

This PR introduces extensibility hooks and multi-cloud support to the model engine Helm chart (version bumped to 0.2.0-beta.1), enabling deployment in airgapped, on-prem, and GCP environments alongside the existing AWS and Azure paths.

  • Extensibility hooks: Adds extraEnvVars, extraVolumes, extraVolumeMounts, and extraPodEnvFrom to allow wrapper charts to inject custom configuration without forking templates
  • Configurable utility images: Replaces hardcoded public.ecr.aws references for busybox, curl, and ubuntu with utilityImages.* values, enabling private registry mirrors
  • Service account token control: Adds automountServiceAccountToken toggle across all deployments/jobs, with projected token volume fallback (CA cert, namespace, short-lived token) for environments requiring explicit token management
  • Redis auth via Kubernetes Secret: Adds redis.authSecretName/redis.authSecretKey for secret-based Redis auth, prioritized over plaintext redis.auth
  • GCP support: Adds GCP Memorystore Redis broker to celery autoscaler, and separate azure.inference_client_id for inference service accounts
  • Helm hook improvements: Appends Release.Revision to job names and adds before-hook-creation delete policy across all hook jobs for idempotent upgrades
  • ConfigMap boolean handling: Fixes service_config_map.yaml to render boolean config values without quotes using kindIs "bool"
  • Configurable inference frameworks: Makes inference framework image tags configurable via inferenceFramework.* values instead of hardcoded "latest"
  • GPU tolerations: Adds configurable gpuTolerations to balloon deployments and service templates

Confidence Score: 4/5

  • This PR is largely safe to merge — it adds well-structured extensibility without changing default behavior, though two known issues flagged in previous threads should be resolved first.
  • The changes are backward-compatible (all new values have sensible defaults), the patterns are applied consistently across templates, and the core extensibility design is sound. Deducting one point because of the previously-flagged with gate issue on automountServiceAccountToken in service_account.yaml (which silently drops the field when set to false) and the hook weight collision between database_init_job and database_migration_job. Both issues have been raised in prior review threads.
  • Pay close attention to charts/model-engine/templates/service_account.yaml (automountServiceAccountToken with gate) and charts/model-engine/templates/database_init_job.yaml (hook weight ordering with migration job).

Important Files Changed

Filename Overview
charts/model-engine/templates/_helpers.tpl Adds projected token volume/mount helpers for non-automount SA token mode, extra env/volume/volumeMount extensibility hooks, Redis auth via Kubernetes Secret, and on-prem AWS_PROFILE handling. Core helper changes that cascade to all deployments.
charts/model-engine/values.yaml Adds new configurable defaults for inference frameworks, extensibility hooks (extraEnvVars, extraVolumes, extraVolumeMounts, extraPodEnvFrom), utility images, GPU tolerations, automountServiceAccountToken, and Redis secret-based auth. Missing trailing newline at end of file.
charts/model-engine/templates/service_account.yaml Adds automountServiceAccountToken to ServiceAccount resource using with gate. The with block is falsy when value is false, meaning the field won't render when disabled — a known issue flagged in a previous review thread.
charts/model-engine/templates/celery_autoscaler_stateful_set.yaml Adds GCP Memorystore Redis broker support, automountServiceAccountToken control, and projected token volumes/mounts. Restructures volumeMounts/volumes to always render keys even when AWS is not configured.
charts/model-engine/templates/database_init_job.yaml New file for database initialization hook job. Uses same hook-weight (-1) as migration job, which means Helm doesn't guarantee ordering between the two.
charts/model-engine/templates/service_config_map.yaml Adds kindIs "bool" check to render boolean config values without quotes, applied consistently across all four config sections (launch + infra in both namespaces).
charts/model-engine/templates/service_template_config_map.yaml Adds GPU tolerations to all service templates, configurable utility images (busybox, curl), extraPodEnvFrom for inference pods, Release.Revision to job names, and captures gpuTolerations from values.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[values.yaml] -->|defaults| B[_helpers.tpl]
    A -->|automountServiceAccountToken| C{automount enabled?}
    C -->|true| D[Standard SA token mount]
    C -->|false| E[Projected token volume\nCA cert + namespace + short-lived token]

    B -->|serviceEnvGitTagFromHelmVar| F[Gateway / Builder / Cacher]
    B -->|volumes + volumeMounts| F
    A -->|extraEnvVars| B
    A -->|extraVolumes / extraVolumeMounts| B

    A -->|utilityImages.*| G[Balloon Deployments]
    A -->|utilityImages.*| H[Service Templates\nbusybox / curl]
    A -->|gpuTolerations| G
    A -->|gpuTolerations| H

    A -->|redis.authSecretName| I{Secret-based Redis auth?}
    I -->|yes| J[secretKeyRef for REDIS_AUTH_TOKEN]
    I -->|no| K{redis.auth set?}
    K -->|yes| L[Plaintext REDIS_AUTH_TOKEN]
    K -->|no| M[No Redis auth]

    A -->|inferenceFramework.*| N[inference_framework_config ConfigMap]
    A -->|extraPodEnvFrom| H
    A -->|config.values.infra.cloud_provider=gcp| O[Celery Autoscaler\nGCP Memorystore broker]
Loading

Last reviewed commit: bf4155c

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

22 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link

greptile-apps bot commented Feb 20, 2026

Additional Comments (1)

charts/model-engine/templates/proportional_a100_autoscaler_deployment.yaml
Hardcoded image breaks airgapped deployments

This PR parameterizes ubuntu, busybox, and curl images via utilityImages for airgapped/on-prem environments, but the cluster-proportional-autoscaler image from registry.k8s.io remains hardcoded here and in the a10/t4 autoscaler deployments. In an airgapped environment, registry.k8s.io would be unreachable just like the other registries that were parameterized.

Consider adding a utilityImages.clusterProportionalAutoscaler entry (or similar) to values.yaml and referencing it here:

        - image: {{ .Values.utilityImages.clusterProportionalAutoscaler.repository }}:{{ .Values.utilityImages.clusterProportionalAutoscaler.tag }}
Prompt To Fix With AI
This is a comment left during a code review.
Path: charts/model-engine/templates/proportional_a100_autoscaler_deployment.yaml
Line: 30

Comment:
**Hardcoded image breaks airgapped deployments**

This PR parameterizes ubuntu, busybox, and curl images via `utilityImages` for airgapped/on-prem environments, but the `cluster-proportional-autoscaler` image from `registry.k8s.io` remains hardcoded here and in the a10/t4 autoscaler deployments. In an airgapped environment, `registry.k8s.io` would be unreachable just like the other registries that were parameterized.

Consider adding a `utilityImages.clusterProportionalAutoscaler` entry (or similar) to `values.yaml` and referencing it here:

```suggestion
        - image: {{ .Values.utilityImages.clusterProportionalAutoscaler.repository }}:{{ .Values.utilityImages.clusterProportionalAutoscaler.tag }}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +24 to +26
{{- with $.Values.automountServiceAccountToken }}
automountServiceAccountToken: {{ . }}
{{- end }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with skips block when value is false

Helm's with treats false as falsy, so when automountServiceAccountToken is set to false (the exact scenario where you want to disable automount on the ServiceAccount), this block will be skipped entirely and automountServiceAccountToken won't be rendered into the YAML. Only when the value is true will it appear — which is the opposite of the critical behavior.

In contrast, the pod-level specs in other templates correctly use {{ .Values.automountServiceAccountToken }} directly without a with gate.

Suggested change
{{- with $.Values.automountServiceAccountToken }}
automountServiceAccountToken: {{ . }}
{{- end }}
automountServiceAccountToken: {{ $.Values.automountServiceAccountToken }}
Prompt To Fix With AI
This is a comment left during a code review.
Path: charts/model-engine/templates/service_account.yaml
Line: 24-26

Comment:
**`with` skips block when value is `false`**

Helm's `with` treats `false` as falsy, so when `automountServiceAccountToken` is set to `false` (the exact scenario where you want to disable automount on the ServiceAccount), this block will be skipped entirely and `automountServiceAccountToken` won't be rendered into the YAML. Only when the value is `true` will it appear — which is the opposite of the critical behavior. 

In contrast, the pod-level specs in other templates correctly use `{{ .Values.automountServiceAccountToken }}` directly without a `with` gate.

```suggestion
automountServiceAccountToken: {{ $.Values.automountServiceAccountToken }}
```

How can I resolve this? If you propose a fix, please make it concise.

{{- include "modelEngine.labels" . | nindent 4 }}
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hook weight collision with migration job

Both database_init_job.yaml and database_migration_job.yaml use "helm.sh/hook-weight": "-1" on pre-install,pre-upgrade. This means Helm does not guarantee execution order between them. Since init_database.py creates schemas (create schema if not exists) and tables (Base.metadata.create_all), and the migration job applies Alembic migrations that depend on those schemas existing, the migration could run first and fail.

Consider setting this job's hook weight to "-2" (or lower) so it is guaranteed to run before the migration job at weight "-1".

Suggested change
"helm.sh/hook-weight": "-1"
"helm.sh/hook-weight": "-2"
Prompt To Fix With AI
This is a comment left during a code review.
Path: charts/model-engine/templates/database_init_job.yaml
Line: 10

Comment:
**Hook weight collision with migration job**

Both `database_init_job.yaml` and `database_migration_job.yaml` use `"helm.sh/hook-weight": "-1"` on `pre-install,pre-upgrade`. This means Helm does not guarantee execution order between them. Since `init_database.py` creates schemas (`create schema if not exists`) and tables (`Base.metadata.create_all`), and the migration job applies Alembic migrations that depend on those schemas existing, the migration could run first and fail.

Consider setting this job's hook weight to `"-2"` (or lower) so it is guaranteed to run before the migration job at weight `"-1"`.

```suggestion
    "helm.sh/hook-weight": "-2"
```

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link

greptile-apps bot commented Feb 25, 2026

Additional Comments (2)

charts/model-engine/templates/populate_fine_tuning_repository_job.yaml
Missing automountServiceAccountToken setting

All other pod specs in this PR (gateway, builder, cacher, celery autoscaler, database migration, database init, spellbook init, and the proportional autoscalers) explicitly set automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}. This job and restart_keda_operator.yaml are the only ones missing it.

When automountServiceAccountToken is false, the modelEngine.volumes helper will inject the projected token-volume, but without disabling the default automount on the pod spec, Kubernetes will also mount its own default token — resulting in a conflict at /var/run/secrets/kubernetes.io/serviceaccount.

      automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}
      serviceAccountName: {{ include "modelEngine.fullname" . }}
Prompt To Fix With AI
This is a comment left during a code review.
Path: charts/model-engine/templates/populate_fine_tuning_repository_job.yaml
Line: 44

Comment:
**Missing `automountServiceAccountToken` setting**

All other pod specs in this PR (gateway, builder, cacher, celery autoscaler, database migration, database init, spellbook init, and the proportional autoscalers) explicitly set `automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}`. This job and `restart_keda_operator.yaml` are the only ones missing it.

When `automountServiceAccountToken` is `false`, the `modelEngine.volumes` helper will inject the projected `token-volume`, but without disabling the default automount on the pod spec, Kubernetes will also mount its own default token — resulting in a conflict at `/var/run/secrets/kubernetes.io/serviceaccount`.

```suggestion
      automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}
      serviceAccountName: {{ include "modelEngine.fullname" . }}
```

How can I resolve this? If you propose a fix, please make it concise.

charts/model-engine/templates/restart_keda_operator.yaml
Missing automountServiceAccountToken setting

Same inconsistency as populate_fine_tuning_repository_job.yaml — this job is missing automountServiceAccountToken: {{ .Values.automountServiceAccountToken }} which every other pod spec in this PR has. This will cause issues when automountServiceAccountToken is set to false.

      automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}
      serviceAccountName: {{ include "modelEngine.fullname" . }}
Prompt To Fix With AI
This is a comment left during a code review.
Path: charts/model-engine/templates/restart_keda_operator.yaml
Line: 43

Comment:
**Missing `automountServiceAccountToken` setting**

Same inconsistency as `populate_fine_tuning_repository_job.yaml` — this job is missing `automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}` which every other pod spec in this PR has. This will cause issues when `automountServiceAccountToken` is set to `false`.

```suggestion
      automountServiceAccountToken: {{ .Values.automountServiceAccountToken }}
      serviceAccountName: {{ include "modelEngine.fullname" . }}
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants