-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure no more than one deployer pod is created #16129
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tnozicka Assign the PR to them by writing The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
// We need to transition the deployment to "pending" to make sure we can't ever create the deployer pod twice. | ||
// If we would fail to update phase after and the deployer pod would have gotten deleted in the meantime | ||
// we would have recreated it here. | ||
// The disadvantage is that if the Create call fails we will never create the deployer pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that we have an old controller loop observing that it needs to create a pod. It's a retry? Or something else?
That doesn't mean this might not be the right choice, but I'm concerned about retry failing. The controller itself seems like it has a concurrency bug that should be fixed, rather than changing how this code runs.
Ie:
- Observe need for new deployment.
- Observe running rcs - after this point, the controller should never try to start an older rc than the one it sees
- Cancel older deployment pods
- Create deployment pod for most recent, but only if 2 still holds
Can we correct 4 to ensure the invariant from 2 inside the controller is repsect? This bug implies 4 is proceeding even though we started back at 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton I don't think that can be done. Controller processes queue in parallel for different keys (RCs). The issue is that you can't do atomic operation on more that one object.
See my comment in main section. (I didn't wanted it to get lost/hidden here in case of rebase.)
@smarterclayton Having to prove to you that the current design was racy actually might have helped me discover the reason why this happened. Let's asses the current deployer design: Here is how it breaks:
|
And after 3. it can even successfully deploy more deployments. In the failed test it managed to create 1 successfully and while deploying the other one it failed on the test check for multiple running deployers. I guess we can probably make some hacks and do a live get to check deployment resourceVersion before creating a new deployer. Although it might not be the best course of action. |
cc @Kargakis |
For your example, is an older version of the deployment in the cache (when the key comes out of the rate limited queue)? So even though we actually updated the phase in etcd, the latest event has not yet been delivered to our cache? If so, one way to manage that is through an "expectation" - marking that deployment as "up to date" with the newly updated RV or generation, so when we see the older event come into our queue we know to ignore it. |
Yes, newer RC events are already processed and after that the old event is added to the queue from waiting queue after the rate limiting period passes.
The latest event from that update has been delivered to our cache. The issue is that the old one got re-queued after that latest event has been already processed and since it's older it contains outdated RC.
Yes, we could write a new ControllerExpectations to track processed objects in cache and disregard events with older resourceVersion. But:
I guess we shouldn't re-queue on Update conflict anyways since 409 has to mean that there is a newer RC event already in our queue so why re-enqueue the old one? Such events may actually fill up the queue because 409 will always repeat and re-queue over and over until they reach maxRequeueLimit - unless they manage to fail on something else in between like creating deployer pod that already exists. But that somehow affects just the specific case that broke our tests. The current approach still seems racy to me - a new example:
To ensure that there are no more that one deployer pod it would be sufficient to do a live get and make sure the controller is still in the same phase. To avoid this race of creating deployer pod more than once entirely we need to reorder the logic. |
This controller should only be re-queueing the key, not the old version of the object. So if this was true, then the next controller one would have the newer RC? But it is totally possible for us to edit a RC, then not see it in the queue in time (for us to go ahead and process an older version of the object). That's what expectations are for. |
It would be cleared by the next sync, and it's only going to be the key name, so at worst it's a few hundred k of strings you already have. |
Use generation, but yes, it has that guarantee.
This is not possible if the events are both on RCs, or both on pods. We don't queue object versions, we queue keys. When the time comes to process the key, we can only ever get the latest version from the cache. There may be a latest version out on the cluster (that we haven't observed), but it won't be in our code. If you are dealing across multiple resources, then generally we would try to avoid the race by being level driven - only making decisions that are never undone. In the example you gave, the deployer in step 1 should have set an expectation (before creating the pod) that a pod was created for a given RC. It should not have been cleared until step 5 completes successfully. Given that, one instance of a controller would not have a race (although in this example it's possible two could, although I'd have to think about it more) |
We can't use generation for that because it is incremented only on important changes to spec whereas for RCs the phase is stored in annotations. |
RV is fine too, many other controllers use it as a level check.
…On Wed, Sep 6, 2017 at 3:51 AM, Tomáš Nožička ***@***.***> wrote:
Use generation, but yes, it has that guarantee.
We can't use generation for that because it is incremented only on
important changes to spec whereas for RCs the phase is stored in
annotations.
Seems like it would have to rely on resourceVersion.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16129 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p4mspo9CxDwL_CdkHn4UR3MbslCrks5sfk7ugaJpZM4PMFPE>
.
|
@tnozicka: Your pull request title starts with "WIP", so the do-not-merge/work-in-progress label will be added. This label will ensure that your pull request will not be merged. Remove the prefix from your pull request title to trigger the removal of the label and allow for your pull request to be merged. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7683ce5
to
751301c
Compare
and yes, we queue keys in this controller, (not sure why I was thinking that). At the end I have decided not to go with RV because in case of failure to update RC it wouldn't help us in this case although it would filter events that have already been processed and acted upon, but that's merely a performance improvement. w.r.t. to my previous post: I am not entirely sure what's the sequence of events for the case when this is caused entirely by DeploymentConfig and Deployer controller actions, although I have some suspicions. But here is how it brakes in case the informer caches are not synced yet:
I need to check on Monday the use of expectation in all possible paths but it seems like a way forward. A more reliable one. (thx @smarterclayton for suggesting that.) Although on a funny note, we are solving cache issues with yet another cache :) |
/retest |
@@ -223,6 +240,8 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will | |||
} | |||
|
|||
case deployapi.DeploymentStatusFailed: | |||
c.expectations.ExpectDeletions(key, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we expect a deletion here only if the deployment is cancelled? Also does this take into account hook pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of such function is unfortunate. What it means here is that we have observed a final state; decrement a counter and allow for creating a new deployer to be created if needed - usually only on manual restart, but the intent is to prevent doing so in the interim
/retest |
8dba656
to
428d89f
Compare
…won't start deployer pod more than once.
428d89f
to
5da6d84
Compare
@mfojtik rebased |
/retest |
/lint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis: 1 warning.
In response to this:
/lint
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
key, err := kcontroller.KeyFunc(rc) | ||
if err != nil { | ||
utilruntime.HandleError(fmt.Errorf("Couldn't get key for object %#v: %v", rc, err)) | ||
return key, fmt.Errorf("Couldn't get key for object %#v: %v", rc, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint errors: error strings should not be capitalized or end with punctuation or a newline. More info.
// In case the caches wouldn't be synced yet and we would receive the same RC with state New | ||
// and the deployer pod would have been deleted (by user or otherwise) we would have recreated it again. | ||
// Also a newer deployment might be already running so we would have 2 active deployer pods. | ||
if c.expectations.SatisfiedExpectations(key) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the extra indentation by doing something like
if !c.expectations.SatisfiedExpectations(key) {
break
}
@@ -173,7 +189,8 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will | |||
// to ensure that changes to 'unrelated' pods don't result in updates to | |||
// the deployment. So, the image check will have to be done in other areas | |||
// of the code as well. | |||
if deployutil.DeploymentNameFor(deployer) != deployment.Name { | |||
controllerRef := kcontroller.GetControllerOf(deployer) | |||
if deployutil.DeploymentNameFor(deployer) != deployment.Name || (controllerRef != nil && controllerRef.UID != deployment.UID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this check could be its own function. Also can you add a unit test that exercises both paths?
@@ -315,8 +336,12 @@ func (c *DeploymentController) nextStatus(pod *v1.Pod, deployment *v1.Replicatio | |||
updatedAnnotations[deployapi.DeployerPodCompletedAtAnnotation] = completedTimestamp.String() | |||
} | |||
return deployapi.DeploymentStatusFailed | |||
|
|||
case v1.PodUnknown: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems redundant. Just use the default case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless you want to log this as weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to explicitly state that we know about this case so no one reading the code would doubt if we account for that because that enum might be extended in future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
funny fact: the PodUnknown is never being set in k8s code, so I guess this serves as a future placeholder :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
future generations will face pods in unknown state (quantum containers?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis @mfojtik here are your quantum pods :) https://bugzilla.redhat.com/show_bug.cgi?id=1539143#c0
@@ -282,7 +303,7 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will | |||
return nil | |||
} | |||
|
|||
func (c *DeploymentController) nextStatus(pod *v1.Pod, deployment *v1.ReplicationController, updatedAnnotations map[string]string) deployapi.DeploymentStatus { | |||
func (c *DeploymentController) nextStatus(current deployapi.DeploymentStatus, pod *v1.Pod, deployment *v1.ReplicationController, updatedAnnotations map[string]string) deployapi.DeploymentStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you infer current
from deployment
?
@@ -97,6 +100,12 @@ type DeploymentController struct { | |||
// to a terminal deployment status. Since this controller started using caches, | |||
// the provided rc MUST be deep-copied beforehand (see work() in factory.go). | |||
func (c *DeploymentController) handle(deployment *v1.ReplicationController, willBeDropped bool) error { | |||
key, err := getKeyForReplicationController(deployment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we doing this on top and not before we need it?
@@ -246,6 +265,8 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will | |||
} | |||
|
|||
case deployapi.DeploymentStatusComplete: | |||
c.expectations.ExpectDeletions(key, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ExpectDeletionCountForKey() or ExpectNoDeletions(key) for future generations.
@@ -124,6 +126,15 @@ func (c *DeploymentController) updateReplicationController(old, cur interface{}) | |||
c.enqueueReplicationController(curRC) | |||
} | |||
|
|||
func (c *DeploymentController) addPod(obj interface{}) { | |||
pod := obj.(*v1.Pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pod, ok :=
if !ok {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call panic in if either way, this calls it as well
pod := obj.(*v1.Pod) | ||
|
||
rc, err := c.rcForDeployerPod(pod) | ||
if err == nil && rc != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would return an error in case there is no deployer pod for rc
@stevekuznetsov this is planed for 3.8 work as it introduces significant changes. So paused for a while. It is still probably not the cause of seeing 2 deployer pods in real world as one was 2 triggers racing (solved now) and the other is w.r.t. cancellations causing the same effect. |
@tnozicka PR needs rebase |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
for the record: As it turned out the root cause of having more than one deployer pod created were informers going back in time because of broken watch cache, not this race (although it's there). This will be fixed with the new deployer controller rewrite. |
Deployer controller can create the same deployer pod more than once and even while a newer deployment is running. It already happened at least twice in our tests here: #16003
There it created a deployer pod, succeeded, deployed one more deployment and while deploying the third one it recreated deployer pod for the first one and run it again, in parallel with the latest one.
Two deployer invariants were broken by it:
I couldn't find out what went wrong from the logs but my guess is that between creating deployer pod and updating the phase after, there are intermediate states that get queued up where the deployment is still in phase new but the deployer pod is running. Assuming the processing of them can delay arbitrarily and we manage to delete the deployer pod for at least one of those states the phase is new and the deployer is missing so it creates it (again).
Fixes: #16003