Consolidate ha config into a single enableLeaderElection, also fix rolling update stuck bug #1620

liu-cong · 2025-09-19T17:50:34Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

In addition to fix the bug, I did the following improvements:

Adjusted the gke health check and EPP readiness check to every 2s instead of 5s, this allows faster switch over to the new EPP leader (in my tests this goes from 10-15s to less than 5s).
Consolidated all HA related config to a single enableLeaderElection flag. Previous to enable HA you need to do 3 things: 1. set enableLeaderElection=true 2. set replicas=3 3. set flags.ha-enable-leader-election=true. This is hard to get right.

Which issue(s) this PR fixes:

Fixes #1619

Does this PR introduce a user-facing change?:

netlify · 2025-09-19T17:50:39Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`85398c9`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68cd9bb2d2042c0008268b77
😎 Deploy Preview	https://deploy-preview-1620--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

…lling update stuck bug

kfswain · 2025-09-19T17:58:35Z

config/charts/inferencepool/values.yaml

@@ -1,5 +1,5 @@
 inferenceExtension:
-  replicas: 1
+  enableLeaderElection: false


Wondering if we actually want to make the default true since this is the only path to HA while maintaining prefix-cache perf

Trying to respect current default as this is a patch to a release, but I would be open to that as the best practice

kfswain · 2025-09-19T18:07:26Z

config/charts/inferencepool/templates/epp-deployment.yaml

    {{- include "gateway-api-inference-extension.labels" . | nindent 4 }}
 spec:
-  replicas: {{ .Values.inferenceExtension.replicas | default 1 }}
+  {{- if .Values.inferenceExtension.enableLeaderElection }}


Is there a way to articulate this where the default is 3 if leaderElection is enabled, and the default is 1 otherwise?

This would still allow a user to specify replica count if desired.

We currently suggest active-passive as a best HA practice, but a user could decide they would rather use active-active, incur the performance cost(or maybe their algo works fine with active-active), and use multiple replicas

It's a tradeoff between simplicity and best practice vs the flexibility I had to make. I think in helm we should prioritize the former, as advanced users can always fork and tweak for the additional flexibility they want.

So for the current best practices, I think we recommend HA with 3 replicas for "critical" use cases, and 1 replica for non-critical. We don't recommend active-active due to routing performance reasons. Users can do that if they understand the details, but we don't offer that out of the box in helm. My worry is that if we offer that, they will find the performance worse than what we advertise, and it's not obvious why that happens.

Open for debate but I think simplicity is quite important here. In the current state it's easy to shoot yourself in the foot with multiple active-active replicas without understanding the performance penalty, and meanwhile it's hard to configure the leader election properly as there are 3 things going on (the flag, the replicas, and the rbac).

Perhaps we just need to add some more documentation, and explain that if you want to go active-active, you can tweak this way, and here are the implications, bla bla.

Is this an acceptable outcome? I do think that users who want active-active need to understand the implications, and likely "advanced" use cases. We don't need to make it simple, but we need to articulate it.

I think 3 is a reasonable default that I don't think many would want to change.

this change makes the replicas field not overridable, but hardcoded.

In the current state it's easy to shoot yourself in the foot with multiple active-active replicas without understanding the performance penalty

I agree with the above, therefore I suggest keeping replicas always 1 when leader election is disabled,
but can we change it at least in leaderEnabled setup?
it should be possible to override the number of replicas easily.
I'm expecting something like:

{{- if .Values.inferenceExtension.enableLeaderElection }} replicas: {{ .Values.inferenceExtension.replicas | default 3 }} {{- else }} replicas: 1 {{- end }}

so we get default of 3/1 (depends on the HA settings), but can still override the value of replicas as we wish.

alternatively, another proposal -
we can remove the enableLeaderElection flag completely from helm and use only replicas field.
then we add if/else to the helm templates -
if replicas is 1 - no leader election
if replicas is more than 1 - leader election enabled.

this way, there is no way for users to get confused in their setup because they set only a single field and we set the leader election for them automatically.
so we change the deployment template as follows:

{{- if gt .Values.inferenceExtension.replicas 1 }} - --ha-enable-leader-election

I like this proposal more, since it keeps the user away from leader election enabled/disabled and keeps him focused only on the number of replicas.
we currently don't want to support active active mode, and therefore it shouldn't be possible to configure it through our helm chart.

created PR #1628

yangligt2 · 2025-09-19T18:32:02Z

Regarding the 1st fix: "Adjusted the gke health check and EPP readiness check to every 2s instead of 5s, this allows faster switch over to the new EPP leader (in my tests this goes from 10-15s to less than 5s)."
I'm not exactly sure if the health check frequency will affect the failover time.
According to this doc: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

-- leader-elect-lease-duration duration Default: 15s

| The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled.

--leader-elect-renew-deadline duration Default: 10s

| The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than the lease duration. This is only applicable if leader election is enabled.

--leader-elect-retry-period duration Default: 2s

| The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled.

liu-cong · 2025-09-19T20:51:45Z

RE: Health check frequency

I did observe with the changes the new leader becomes ready much faster. However I did not test e2e whether the request downtime is actually shorter. These changes are probably beneficial (or at no harm) anyway.

@yangligt2

yangligt2 · 2025-09-19T21:08:01Z

Cool. Then let's go with the health check frequency change.

ahg-g · 2025-09-19T21:42:01Z

/lgtm
/approve

k8s-ci-robot · 2025-09-19T21:42:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…lling update stuck bug (#1620)

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 19, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 19, 2025

k8s-ci-robot requested review from elevran and kfswain September 19, 2025 17:50

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 19, 2025

liu-cong force-pushed the ha branch from 0a45802 to 2caf0de Compare September 19, 2025 17:55

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2025

Consolidate ha config into a single enableLeaderElection, also fix ro…

85398c9

…lling update stuck bug

liu-cong force-pushed the ha branch from 2caf0de to 85398c9 Compare September 19, 2025 18:06

kfswain reviewed Sep 19, 2025

View reviewed changes

liu-cong mentioned this pull request Sep 19, 2025

v1.0.1 patch release #1616

Open

k8s-ci-robot assigned ahg-g Sep 19, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2025

k8s-ci-robot merged commit 248f4e9 into kubernetes-sigs:main Sep 19, 2025
11 checks passed

kfswain mentioned this pull request Sep 22, 2025

use replicas field in helm to decide if EPP should run in HA mode #1628

Merged

kfswain pushed a commit that referenced this pull request Sep 23, 2025

Consolidate ha config into a single enableLeaderElection, also fix ro…

de41a96

…lling update stuck bug (#1620)

Consolidate ha config into a single enableLeaderElection, also fix rolling update stuck bug #1620

Consolidate ha config into a single enableLeaderElection, also fix rolling update stuck bug #1620

Uh oh!

Conversation

liu-cong commented Sep 19, 2025

Uh oh!

netlify bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

kfswain Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

liu-cong Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

kfswain Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

liu-cong Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

liu-cong Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

yangligt2 commented Sep 19, 2025

-- leader-elect-lease-duration duration Default: 15s

--leader-elect-renew-deadline duration Default: 10s

--leader-elect-retry-period duration Default: 2s

Uh oh!

liu-cong commented Sep 19, 2025

Uh oh!

yangligt2 commented Sep 19, 2025

Uh oh!

ahg-g commented Sep 19, 2025

Uh oh!

k8s-ci-robot commented Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

netlify bot commented Sep 19, 2025 •

edited

Loading

liu-cong Sep 19, 2025 •

edited

Loading

nirrozenbaum Sep 21, 2025 •

edited

Loading