-
Notifications
You must be signed in to change notification settings - Fork 1.2k
✨ [Warm Replicas] Implement warm replica support for controllers. #3192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: godwinpang The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @godwinpang. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
5d1be84
to
5db61c7
Compare
/retest |
5db61c7
to
be1b1c2
Compare
/retest |
/retest |
/retest |
/retest |
e05677a
to
1d07efc
Compare
/retest |
1 similar comment
/retest |
/retest |
…y without WaitForWarmupComplete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review of the prod code. I'll try to get to test code as well
@@ -221,60 +299,71 @@ func (c *Controller[request]) Start(ctx context.Context) error { | |||
// startEventSources launches all the sources registered with this controller and waits | |||
// for them to sync. It returns an error if any of the sources fail to start or sync. | |||
func (c *Controller[request]) startEventSources(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think c.Started doesn't work anymore as expected.
With the current version of the PR it's used for two purposes:
- Ensure a 2nd call to controller.Start returns an error
- Ensure we are appending Sources to c.startWatches only until we call Start()
=> 1. still works as expected. 2 leads to problems
Now the following can happen:
- Warmup is completed (i.e. c.startWatches is nil, c.Started is false)
- Watch is called => source gets added to c.startWatches
- Start is called and skips over the sync.Once in startEventSources
=> So the Source added in 2. is never started
I would suggest to
- introduce a new field called
c.startedEventSources
- use
c.startedEventSources
in l.205 instead ofc.Started
- set
c.startedEventSources
to true instartEventSources
after we setc.startWatches
to nil - Update godoc comment in l.207 to:
// Sources weren't started yet, store the sources locally and return.
// These sources are going to be held until either Warmup() or Start(...) is called.
- Add the following to Warmup after we lock / defer unlock (so further calls to Watch are using this context to start watches instead of nil, Start will overwrite this later)
// Set the internal context.
c.ctx = ctx
@alvaroaleman does this sound correct to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense. Also, lets add a test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d8650df fixes this but there is another race that this uncovers while testing
L336 starts a goroutine that can potentially outlive the duration of the caller holding the lock. The errGroup blocks on case <-sourceStartCtx.Done()
, meaning that when the context is cancelled, startEventSources() doesn't wait for the the goroutine on L336 to complete before returning. This means that the mutex acquired in Warmup can be released before L339 is executed.
The problem variable is c.Queue
since it can be read on L339, but also assigned to in Start on L261.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do something as follows? Keep the current error handling behavior, but add a channel blocking until watch.Start is called in the defer block
select {
case err := <-sourceStartErrChan:
return err
case <-sourceStartCtx.Done():
defer func() {<-sourceStartErrChan}() // <-- ⏰ this is the change
if didStartSyncingSource.Load() { // We are racing with WaitForSync, wait for it to let it tell us what happened
return <-sourceStartErrChan
}
if ctx.Err() != nil { // Don't return an error if the root context got cancelled
return nil
}
return fmt.Errorf("timed out waiting for source %s to Start. Please ensure that its Start() method is non-blocking", watch)
}
edit: 730b30e was the attempt, but looks like it doesn't work because it fails the case where watch.Start blocks indefinitely.
edit 2:
bca3e2a I added a hasAccessedQueueChan
channel to track whether or not c.Queue has been accessed and is safe to release the lock. Ideas for a better name is welcome :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is bigger. Does Warmup start the sources with a nil queue? (if it is called before Start)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would have to write c.Queue already in New. This also starts the queue though.
So this would also change the observable behavior of functions like NewUnmanaged / NewTypedUnmanaged.
So I think we can't move it into New.
The easiest might be to move it into startEventSourcesLocked (and rename that func to startEventSourcesAndQueueLocked)
Like this
c.didStartEventSourcesOnce.Do(func() {
queue := c.NewQueue(c.Name, c.RateLimiter)
if priorityQueue, isPriorityQueue := queue.(priorityqueue.PriorityQueue[request]); isPriorityQueue {
c.Queue = priorityQueue
} else {
c.Queue = &priorityQueueWrapper[request]{TypedRateLimitingInterface: queue}
}
go func() {
<-ctx.Done()
c.Queue.ShutDown()
}()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one consequence is that depending on if Warmup or Start is actually running the sync.Once in startEventSourcesLocked the queue is getting shutdown if either the Warmup or one of the other runnable groups is shutdown, but this should be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is bigger. Does Warmup start the sources with a nil queue? (if it is called before Start)
Ohh, extremely good catch.
The easiest might be to move it into startEventSourcesLocked (and rename that func to startEventSourcesAndQueueLocked)
This makes sense to me.
I think one consequence is that depending on if Warmup or Start is actually running the sync.Once in startEventSourcesLocked the queue is getting shutdown if either the Warmup or one of the other runnable groups is shutdown, but this should be fine?
I think so, yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure that our tests validate that the queue we pass to the sources is non-nil? This should've been caught by tests and not by a human ideally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also reviewed tests now
@godwinpang Pleae keep the conversations open (going forward). It's really hard to find the just resolved/fixed conversations to check that everything was addressed |
// which won't be garbage collected if we hold a reference to it. | ||
c.startWatches = nil | ||
|
||
// Mark event sources as started after resetting the startWatches slice to no-op a Watch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Watch won't be a no-op. It will start the source immediately instead of storing it in startWatches
@@ -221,60 +299,71 @@ func (c *Controller[request]) Start(ctx context.Context) error { | |||
// startEventSources launches all the sources registered with this controller and waits | |||
// for them to sync. It returns an error if any of the sources fail to start or sync. | |||
func (c *Controller[request]) startEventSources(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is bigger. Does Warmup start the sources with a nil queue? (if it is called before Start)
@godwinpang Heyho, discussed this with @alvaroaleman. I think we're making really good progress on this PR and getting very close to merging it. We would just like to take a bit more time to ensure we don't break anything (I"ll also try to test it with Cluster API). So we'll go ahead and cut the CR v0.21 release now already (a few folks are waiting for that :)). Once this PR merges we'll cherry-pick and cut a v0.21.1 patch release. (I also modified the PR title as this PR is not breaking) |
This change implements the proposal for warm replicas as proposed in #3121.
It adds a
NeedWarmUp
option for controllers to optionally start as warmed replicas.Builds upon #3190.