Skip to content

✨ Implement warm replica support for controllers #3192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

godwinpang
Copy link
Contributor

@godwinpang godwinpang commented Apr 9, 2025

This change implements the proposal for warm replicas as proposed in #3121.
It adds an EnableWarmup option for controllers to optionally start as warmed replicas, which means that sources will be started before leader election has been won by the instance.

It also adds a new internal runnable interface called warmupRunnable with a single Warmup method that will be called before leader election. There is no guarantee that the Warmup method returns before leader election, just that it is called.

The controller's implementation of the Warmup method for the warmupRunnable starts the sources with a threadsafe method startEventSourcesAndQueueLocked that also adds events to the queue. In the case of a non-leader elected controller, it is intended for the Warmup method to race with the Start method to start the sources. The methods are synchronized by the didStartEventSourcesOnce sync.Once used in startEventSourcesAndQueueLocked.

For the most part, Warmup behaves exactly the same as Start with the following differences

  • There is no restriction as to how many times Warmup can be called, vs. Start which can only be called once.
  • Warmup doesn't initialize metrics for the controller, nor does it start the worker goroutines.

Shutdown

Warmup runnables are shutdown in a separate goroutine from the one that stops the leader election runnables, because there is no guarantee as to whether or not the Warmup or the Start method is holding the lock for the didStartEventSourcesOnce sync.Once instance.

Testing

On top of the unit + integration tests, this PR was tested with an internal project that ran this branch of controller-runtime for a week with (EnableWarmup=true`, and verified that the controllers on the non-leader elected replica had a populated workqueue.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 9, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 9, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @godwinpang. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@godwinpang godwinpang marked this pull request as draft April 9, 2025 07:12
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 9, 2025
@sbueringer
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2025
@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang godwinpang force-pushed the warm-replica-impl branch from e05677a to 1d07efc Compare May 2, 2025 05:56
@godwinpang
Copy link
Contributor Author

/retest

1 similar comment
@godwinpang
Copy link
Contributor Author

/retest

@godwinpang godwinpang force-pushed the warm-replica-impl branch from 1c32d37 to 7cc29dc Compare May 2, 2025 06:26
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 22, 2025
@sbueringer
Copy link
Member

sbueringer commented May 22, 2025

@godwinpang Pleae keep the conversations open (going forward). It's really hard to find the just resolved/fixed conversations to check that everything was addressed

// which won't be garbage collected if we hold a reference to it.
c.startWatches = nil

// Mark event sources as started after resetting the startWatches slice to no-op a Watch()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch won't be a no-op. It will start the source immediately instead of storing it in startWatches

@@ -221,60 +299,71 @@ func (c *Controller[request]) Start(ctx context.Context) error {
// startEventSources launches all the sources registered with this controller and waits
// for them to sync. It returns an error if any of the sources fail to start or sync.
func (c *Controller[request]) startEventSources(ctx context.Context) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is bigger. Does Warmup start the sources with a nil queue? (if it is called before Start)

@sbueringer
Copy link
Member

@godwinpang Heyho, discussed this with @alvaroaleman.

I think we're making really good progress on this PR and getting very close to merging it. We would just like to take a bit more time to ensure we don't break anything (I"ll also try to test it with Cluster API).

So we'll go ahead and cut the CR v0.21 release now already (a few folks are waiting for that :)). Once this PR merges we'll cherry-pick and cut a v0.21.1 patch release.

(I also modified the PR title as this PR is not breaking)

@sbueringer sbueringer changed the title ⚠️ [Warm Replicas] Implement warm replica support for controllers. ✨ [Warm Replicas] Implement warm replica support for controllers. May 22, 2025
@godwinpang
Copy link
Contributor Author

A heads up that I am OOO this week so there will not be much progress on this PR. We're going to enable this warmup feature internally on June 2nd when I get back for some soak time.

@godwinpang
Copy link
Contributor Author

We've been running a forked version of c-r with EnableWarmup set to true for the past 2 days or so, and everything is working fine. The workqueue_depth metric shows that the non leader-elected pod's workqueues are being populated as expected.

I'll take a stab at

  1. Adding a few more tests around Warmup behavior to make sure it has parity with Start esp around context timeout/cancellation.
  2. Updating the PR description + design with what the latest implementation is.

Anything else we are looking to do to gain more confidence around this feature and get it merged? 😄

@sbueringer
Copy link
Member

Sounds fine to me

@alvaroaleman
Copy link
Member

@godwinpang any plans on wrapping this up?

@godwinpang
Copy link
Contributor Author

@godwinpang any plans on wrapping this up?

Sorry, I missed this Github notification. I've updated the PR description and also duplicated the tests between Start and Warmup as appropriate.

Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
/hold

in case Stefan has anything to add

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 781742318cfc9c5c72173ae51ebe9fb41673044d

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, godwinpang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2025
@alvaroaleman alvaroaleman changed the title ✨ [Warm Replicas] Implement warm replica support for controllers. ✨ Implement warm replica support for controllers Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants