Skip to content

Conversation

liu-cong
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

In addition to fix the bug, I did the following improvements:

  1. Adjusted the gke health check and EPP readiness check to every 2s instead of 5s, this allows faster switch over to the new EPP leader (in my tests this goes from 10-15s to less than 5s).
  2. Consolidated all HA related config to a single enableLeaderElection flag. Previous to enable HA you need to do 3 things: 1. set enableLeaderElection=true 2. set replicas=3 3. set flags.ha-enable-leader-election=true. This is hard to get right.

Which issue(s) this PR fixes:

Fixes #1619

Does this PR introduce a user-facing change?:


@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 19, 2025
Copy link

netlify bot commented Sep 19, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 85398c9
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68cd9bb2d2042c0008268b77
😎 Deploy Preview https://deploy-preview-1620--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 19, 2025
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 19, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2025
@@ -1,5 +1,5 @@
inferenceExtension:
replicas: 1
enableLeaderElection: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we actually want to make the default true since this is the only path to HA while maintaining prefix-cache perf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to respect current default as this is a patch to a release, but I would be open to that as the best practice

{{- include "gateway-api-inference-extension.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.inferenceExtension.replicas | default 1 }}
{{- if .Values.inferenceExtension.enableLeaderElection }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to articulate this where the default is 3 if leaderElection is enabled, and the default is 1 otherwise?

This would still allow a user to specify replica count if desired.

We currently suggest active-passive as a best HA practice, but a user could decide they would rather use active-active, incur the performance cost(or maybe their algo works fine with active-active), and use multiple replicas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tradeoff between simplicity and best practice vs the flexibility I had to make. I think in helm we should prioritize the former, as advanced users can always fork and tweak for the additional flexibility they want.

So for the current best practices, I think we recommend HA with 3 replicas for "critical" use cases, and 1 replica for non-critical. We don't recommend active-active due to routing performance reasons. Users can do that if they understand the details, but we don't offer that out of the box in helm. My worry is that if we offer that, they will find the performance worse than what we advertise, and it's not obvious why that happens.

Open for debate but I think simplicity is quite important here. In the current state it's easy to shoot yourself in the foot with multiple active-active replicas without understanding the performance penalty, and meanwhile it's hard to configure the leader election properly as there are 3 things going on (the flag, the replicas, and the rbac).

Copy link
Contributor Author

@liu-cong liu-cong Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we just need to add some more documentation, and explain that if you want to go active-active, you can tweak this way, and here are the implications, bla bla.

Is this an acceptable outcome? I do think that users who want active-active need to understand the implications, and likely "advanced" use cases. We don't need to make it simple, but we need to articulate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 3 is a reasonable default that I don't think many would want to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change makes the replicas field not overridable, but hardcoded.

In the current state it's easy to shoot yourself in the foot with multiple active-active replicas without understanding the performance penalty

I agree with the above, therefore I suggest keeping replicas always 1 when leader election is disabled,
but can we change it at least in leaderEnabled setup?
it should be possible to override the number of replicas easily.
I'm expecting something like:

{{- if .Values.inferenceExtension.enableLeaderElection }}
replicas: {{ .Values.inferenceExtension.replicas | default 3 }}
{{- else }}
replicas: 1
{{- end }}

so we get default of 3/1 (depends on the HA settings), but can still override the value of replicas as we wish.

Copy link
Contributor

@nirrozenbaum nirrozenbaum Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, another proposal -
we can remove the enableLeaderElection flag completely from helm and use only replicas field.
then we add if/else to the helm templates -
if replicas is 1 - no leader election
if replicas is more than 1 - leader election enabled.

this way, there is no way for users to get confused in their setup because they set only a single field and we set the leader election for them automatically.
so we change the deployment template as follows:

{{- if gt .Values.inferenceExtension.replicas 1 }}
- --ha-enable-leader-election

I like this proposal more, since it keeps the user away from leader election enabled/disabled and keeps him focused only on the number of replicas.
we currently don't want to support active active mode, and therefore it shouldn't be possible to configure it through our helm chart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created PR #1628

@yangligt2
Copy link
Contributor

Regarding the 1st fix: "Adjusted the gke health check and EPP readiness check to every 2s instead of 5s, this allows faster switch over to the new EPP leader (in my tests this goes from 10-15s to less than 5s)."
I'm not exactly sure if the health check frequency will affect the failover time.
According to this doc: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

-- leader-elect-lease-duration duration     Default: 15s

  | The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled.

--leader-elect-renew-deadline duration     Default: 10s

  | The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than the lease duration. This is only applicable if leader election is enabled.

--leader-elect-retry-period duration     Default: 2s

  | The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled.

@liu-cong
Copy link
Contributor Author

RE: Health check frequency

I did observe with the changes the new leader becomes ready much faster. However I did not test e2e whether the request downtime is actually shorter. These changes are probably beneficial (or at no harm) anyway.

@yangligt2

@yangligt2
Copy link
Contributor

Cool. Then let's go with the health check frequency change.

@ahg-g
Copy link
Contributor

ahg-g commented Sep 19, 2025

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2025
@k8s-ci-robot k8s-ci-robot merged commit 248f4e9 into kubernetes-sigs:main Sep 19, 2025
11 checks passed
kfswain pushed a commit that referenced this pull request Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rollout is stuck when leader election is enabled

6 participants