Skip to content

✨ [Warm Replicas] Implement warm replica support for controllers. #3192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

godwinpang
Copy link
Contributor

@godwinpang godwinpang commented Apr 9, 2025

This change implements the proposal for warm replicas as proposed in #3121.
It adds a NeedWarmUp option for controllers to optionally start as warmed replicas.

Builds upon #3190.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: godwinpang
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 9, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 9, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @godwinpang. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@godwinpang godwinpang marked this pull request as draft April 9, 2025 07:12
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 9, 2025
@sbueringer
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2025
@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang godwinpang force-pushed the warm-replica-impl branch from e05677a to 1d07efc Compare May 2, 2025 05:56
@godwinpang
Copy link
Contributor Author

/retest

1 similar comment
@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 19, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 20, 2025
Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the prod code. I'll try to get to test code as well

@@ -221,60 +299,71 @@ func (c *Controller[request]) Start(ctx context.Context) error {
// startEventSources launches all the sources registered with this controller and waits
// for them to sync. It returns an error if any of the sources fail to start or sync.
func (c *Controller[request]) startEventSources(ctx context.Context) error {
Copy link
Member

@sbueringer sbueringer May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think c.Started doesn't work anymore as expected.

With the current version of the PR it's used for two purposes:

  1. Ensure a 2nd call to controller.Start returns an error
  2. Ensure we are appending Sources to c.startWatches only until we call Start()

=> 1. still works as expected. 2 leads to problems

Now the following can happen:

  1. Warmup is completed (i.e. c.startWatches is nil, c.Started is false)
  2. Watch is called => source gets added to c.startWatches
  3. Start is called and skips over the sync.Once in startEventSources

=> So the Source added in 2. is never started

I would suggest to

  • introduce a new field called c.startedEventSources
  • use c.startedEventSources in l.205 instead of c.Started
  • set c.startedEventSources to true in startEventSources after we set c.startWatches to nil
  • Update godoc comment in l.207 to:
	// Sources weren't started yet, store the sources locally and return.
	// These sources are going to be held until either Warmup() or Start(...) is called.
  • Add the following to Warmup after we lock / defer unlock (so further calls to Watch are using this context to start watches instead of nil, Start will overwrite this later)
	// Set the internal context.
	c.ctx = ctx

@alvaroaleman does this sound correct to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense. Also, lets add a test for this

Copy link
Contributor Author

@godwinpang godwinpang May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d8650df fixes this but there is another race that this uncovers while testing

L336 starts a goroutine that can potentially outlive the duration of the caller holding the lock. The errGroup blocks on case <-sourceStartCtx.Done(), meaning that when the context is cancelled, startEventSources() doesn't wait for the the goroutine on L336 to complete before returning. This means that the mutex acquired in Warmup can be released before L339 is executed.

The problem variable is c.Queue since it can be read on L339, but also assigned to in Start on L261.

Copy link
Contributor Author

@godwinpang godwinpang May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do something as follows? Keep the current error handling behavior, but add a channel blocking until watch.Start is called in the defer block

select {
case err := <-sourceStartErrChan:
	return err
case <-sourceStartCtx.Done():
        defer func() {<-sourceStartErrChan}() // <-- ⏰ this is the change
	if didStartSyncingSource.Load() { // We are racing with WaitForSync, wait for it to let it tell us what happened
		return <-sourceStartErrChan
	}
	if ctx.Err() != nil { // Don't return an error if the root context got cancelled
		return nil
	}
	return fmt.Errorf("timed out waiting for source %s to Start. Please ensure that its Start() method is non-blocking", watch)
}

edit: 730b30e was the attempt, but looks like it doesn't work because it fails the case where watch.Start blocks indefinitely.

edit 2:
bca3e2a I added a hasAccessedQueueChan channel to track whether or not c.Queue has been accessed and is safe to release the lock. Ideas for a better name is welcome :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is bigger. Does Warmup start the sources with a nil queue? (if it is called before Start)

Copy link
Member

@sbueringer sbueringer May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would have to write c.Queue already in New. This also starts the queue though.
So this would also change the observable behavior of functions like NewUnmanaged / NewTypedUnmanaged.
So I think we can't move it into New.

The easiest might be to move it into startEventSourcesLocked (and rename that func to startEventSourcesAndQueueLocked)

Like this

	c.didStartEventSourcesOnce.Do(func() {
		queue := c.NewQueue(c.Name, c.RateLimiter)
		if priorityQueue, isPriorityQueue := queue.(priorityqueue.PriorityQueue[request]); isPriorityQueue {
			c.Queue = priorityQueue
		} else {
			c.Queue = &priorityQueueWrapper[request]{TypedRateLimitingInterface: queue}
		}
		go func() {
			<-ctx.Done()
			c.Queue.ShutDown()
		}()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one consequence is that depending on if Warmup or Start is actually running the sync.Once in startEventSourcesLocked the queue is getting shutdown if either the Warmup or one of the other runnable groups is shutdown, but this should be fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is bigger. Does Warmup start the sources with a nil queue? (if it is called before Start)

Ohh, extremely good catch.

The easiest might be to move it into startEventSourcesLocked (and rename that func to startEventSourcesAndQueueLocked)

This makes sense to me.

I think one consequence is that depending on if Warmup or Start is actually running the sync.Once in startEventSourcesLocked the queue is getting shutdown if either the Warmup or one of the other runnable groups is shutdown, but this should be fine?

I think so, yeah

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure that our tests validate that the queue we pass to the sources is non-nil? This should've been caught by tests and not by a human ideally

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reviewed tests now

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 22, 2025
@sbueringer
Copy link
Member

sbueringer commented May 22, 2025

@godwinpang Pleae keep the conversations open (going forward). It's really hard to find the just resolved/fixed conversations to check that everything was addressed

// which won't be garbage collected if we hold a reference to it.
c.startWatches = nil

// Mark event sources as started after resetting the startWatches slice to no-op a Watch()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch won't be a no-op. It will start the source immediately instead of storing it in startWatches

@@ -221,60 +299,71 @@ func (c *Controller[request]) Start(ctx context.Context) error {
// startEventSources launches all the sources registered with this controller and waits
// for them to sync. It returns an error if any of the sources fail to start or sync.
func (c *Controller[request]) startEventSources(ctx context.Context) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is bigger. Does Warmup start the sources with a nil queue? (if it is called before Start)

@sbueringer
Copy link
Member

@godwinpang Heyho, discussed this with @alvaroaleman.

I think we're making really good progress on this PR and getting very close to merging it. We would just like to take a bit more time to ensure we don't break anything (I"ll also try to test it with Cluster API).

So we'll go ahead and cut the CR v0.21 release now already (a few folks are waiting for that :)). Once this PR merges we'll cherry-pick and cut a v0.21.1 patch release.

(I also modified the PR title as this PR is not breaking)

@sbueringer sbueringer changed the title ⚠️ [Warm Replicas] Implement warm replica support for controllers. ✨ [Warm Replicas] Implement warm replica support for controllers. May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants