Make sure no more than one deployer pod is created #16129

tnozicka · 2017-09-04T16:03:26Z

Deployer controller can create the same deployer pod more than once and even while a newer deployment is running. It already happened at least twice in our tests here: #16003

There it created a deployer pod, succeeded, deployed one more deployment and while deploying the third one it recreated deployer pod for the first one and run it again, in parallel with the latest one.

Two deployer invariants were broken by it:

it shouldn't run a deployer pod more than once
there can be only one active deployment

I couldn't find out what went wrong from the logs but my guess is that between creating deployer pod and updating the phase after, there are intermediate states that get queued up where the deployment is still in phase new but the deployer pod is running. Assuming the processing of them can delay arbitrarily and we manage to delete the deployer pod for at least one of those states the phase is new and the deployer is missing so it creates it (again).

Fixes: #16003

openshift-merge-robot · 2017-09-04T16:03:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tnozicka
We suggest the following additional approver: smarterclayton

Assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

pkg/cmd/OWNERS
pkg/deploy/OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

smarterclayton · 2017-09-04T16:36:29Z

pkg/deploy/controller/deployer/deployer_controller.go

+			// We need to transition the deployment to "pending" to make sure we can't ever create the deployer pod twice.
+			// If we would fail to update phase after and the deployer pod would have gotten deleted in the meantime
+			// we would have recreated it here.
+			// The disadvantage is that if the Create call fails we will never create the deployer pod.


It's interesting that we have an old controller loop observing that it needs to create a pod. It's a retry? Or something else?

That doesn't mean this might not be the right choice, but I'm concerned about retry failing. The controller itself seems like it has a concurrency bug that should be fixed, rather than changing how this code runs.

Ie:

Observe need for new deployment.

Observe running rcs - after this point, the controller should never try to start an older rc than the one it sees

Cancel older deployment pods

Create deployment pod for most recent, but only if 2 still holds

Can we correct 4 to ensure the invariant from 2 inside the controller is repsect? This bug implies 4 is proceeding even though we started back at 1

@smarterclayton I don't think that can be done. Controller processes queue in parallel for different keys (RCs). The issue is that you can't do atomic operation on more that one object.

See my comment in main section. (I didn't wanted it to get lost/hidden here in case of rebase.)

tnozicka · 2017-09-04T18:23:17Z

@smarterclayton Having to prove to you that the current design was racy actually might have helped me discover the reason why this happened.

Let's asses the current deployer design:
Because it creates the deployer first and then updates the phase (which is part of the state machine) it creates an "invalid" intermediary state. That makes it fully dependent on the assumption that the controller will process events in the order they have occurred. Which is not true!

Here is how it breaks:

It creates deployer pod and then fails to update the phase. Sep 01 07:02:04 ci-prtest-5a37c28-7117-ig-m-sqk1 origin-master-controllers[9828]: I0901 07:02:04.530008 9828 deploymentconfig_controller.go:460] Error syncing deployment config extended-test-cli-deployment-jzj4g-fnb05/history-limit: Operation cannot be fulfilled on deploymentconfigs "history-limit": the object has been modified; please apply your changes to the latest version and try again
The update is probably caused by some other controller e.g. updating state, resolving images, ... This might just be the reason why it started occurring now in our tests although it was racy probably from the start.
It calls c.queue.AddRateLimited(key) https://github.com/tnozicka/origin/blob/4d3ae499e7e0206c79db93333a6d7aa4852ded52/pkg/deploy/controller/deployer/deployer_controller.go#L497 which causes it to be re-queued but after the rate limiting timer so the controller is free to proceed in between that time.
It recovers (updates the phase) and completes deployment.
Now it deletes that deployer pod since it's now in complete phase.
And here comes the rate limited event and gets added to queue. It is old and the phase is still New and the deployer is missing => it creates a deployement.

tnozicka · 2017-09-04T18:33:11Z

And after 3. it can even successfully deploy more deployments. In the failed test it managed to create 1 successfully and while deploying the other one it failed on the test check for multiple running deployers.

I guess we can probably make some hacks and do a live get to check deployment resourceVersion before creating a new deployer. Although it might not be the best course of action.

tnozicka · 2017-09-04T18:34:12Z

cc @Kargakis

smarterclayton · 2017-09-04T19:54:02Z

For your example, is an older version of the deployment in the cache (when the key comes out of the rate limited queue)? So even though we actually updated the phase in etcd, the latest event has not yet been delivered to our cache? If so, one way to manage that is through an "expectation" - marking that deployment as "up to date" with the newly updated RV or generation, so when we see the older event come into our queue we know to ignore it.

tnozicka · 2017-09-05T10:37:15Z

For your example, is an older version of the deployment in the cache (when the key comes out of the rate limited queue)?

Yes, newer RC events are already processed and after that the old event is added to the queue from waiting queue after the rate limiting period passes.

So even though we actually updated the phase in etcd, the latest event has not yet been delivered to our cache?

The latest event from that update has been delivered to our cache. The issue is that the old one got re-queued after that latest event has been already processed and since it's older it contains outdated RC.

If so, one way to manage that is through an "expectation" - marking that deployment as "up to date" with the newly updated RV or generation, so when we see the older event come into our queue we know to ignore it.

Yes, we could write a new ControllerExpectations to track processed objects in cache and disregard events with older resourceVersion. But:

It has to be TTL based to avoid accumulating memory over time and holding RV for every RC in the cluster in memory
But the TTL must be guaranteed to always last longer that any rate limiting duration and also API watch timeout, maybe w.r.t. something else
I couldn't find a guarantee that resourceVersion would be strictly monotonically increasing, although the current etcd implementation does say "monotonically-incrementing" so theoretically it might change although if it did resourceVersion would be really confusing without such guaranteee. https://github.com/kubernetes/community/blob/82708f9228f71c4c9cccb63354ec5a17be49818a/contributors/devel/api-conventions.md#concurrency-control-and-consistency
It won't solve the race example bellow

I guess we shouldn't re-queue on Update conflict anyways since 409 has to mean that there is a newer RC event already in our queue so why re-enqueue the old one? Such events may actually fill up the queue because 409 will always repeat and re-queue over and over until they reach maxRequeueLimit - unless they manage to fail on something else in between like creating deployer pod that already exists.
But if there is any other error we have to re-enqueue anyways...

But that somehow affects just the specific case that broke our tests. The current approach still seems racy to me - a new example:

Deployer controller processes event #0 and creates deployer pod but doesn't update RC yet.
Deployer pod runs...
(Some controller or user modifies RC - queues up event Add link to GOPATH doc and fixup typo #1 phase=New)
User deletes deployer pod - queues up event Add alpha API documentation #2 phase=New
Deployer controller catches up and tries to Update RC from phase New->Pending but fails because 409 (generally even without that for other reasons)
Deployer controller processes event Add link to GOPATH doc and fixup typo #1 - phase=New and deployer pod is missing => creates deployer pod

To ensure that there are no more that one deployer pod it would be sufficient to do a live get and make sure the controller is still in the same phase.

To avoid this race of creating deployer pod more than once entirely we need to reorder the logic.

smarterclayton · 2017-09-05T21:17:26Z

The issue is that the old one got re-queued after that latest event has been already processed and since it's older it contains outdated RC.

This controller should only be re-queueing the key, not the old version of the object. So if this was true, then the next controller one would have the newer RC? But it is totally possible for us to edit a RC, then not see it in the queue in time (for us to go ahead and process an older version of the object). That's what expectations are for.

smarterclayton · 2017-09-05T21:18:04Z

It has to be TTL based to avoid accumulating memory over time and holding RV for every RC in the cluster in memory

It would be cleared by the next sync, and it's only going to be the key name, so at worst it's a few hundred k of strings you already have.

smarterclayton · 2017-09-05T21:27:37Z

I couldn't find a guarantee that resourceVersion would be strictly monotonically increasing, although the current etcd implementation does say "monotonically-incrementing" so theoretically it might change although if it did resourceVersion would be really confusing without such guaranteee.

Use generation, but yes, it has that guarantee.

(Some controller or user modifies RC - queues up event #1 phase=New)
User deletes deployer pod - queues up event #2 phase=New

This is not possible if the events are both on RCs, or both on pods. We don't queue object versions, we queue keys. When the time comes to process the key, we can only ever get the latest version from the cache. There may be a latest version out on the cluster (that we haven't observed), but it won't be in our code. If you are dealing across multiple resources, then generally we would try to avoid the race by being level driven - only making decisions that are never undone.

In the example you gave, the deployer in step 1 should have set an expectation (before creating the pod) that a pod was created for a given RC. It should not have been cleared until step 5 completes successfully. Given that, one instance of a controller would not have a race (although in this example it's possible two could, although I'd have to think about it more)

tnozicka · 2017-09-06T07:51:08Z

Use generation, but yes, it has that guarantee.

We can't use generation for that because it is incremented only on important changes to spec whereas for RCs the phase is stored in annotations.
Seems like it would have to rely on resourceVersion.

smarterclayton · 2017-09-07T03:53:56Z

RV is fine too, many other controllers use it as a level check.

…

On Wed, Sep 6, 2017 at 3:51 AM, Tomáš Nožička ***@***.***> wrote: Use generation, but yes, it has that guarantee. We can't use generation for that because it is incremented only on important changes to spec whereas for RCs the phase is stored in annotations. Seems like it would have to rely on resourceVersion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16129 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p4mspo9CxDwL_CdkHn4UR3MbslCrks5sfk7ugaJpZM4PMFPE> .

openshift-ci-robot · 2017-09-07T20:16:17Z

@tnozicka: Your pull request title starts with "WIP", so the do-not-merge/work-in-progress label will be added.

This label will ensure that your pull request will not be merged. Remove the prefix from your pull request title to trigger the removal of the label and allow for your pull request to be merged.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tnozicka · 2017-09-08T21:22:10Z

and yes, we queue keys in this controller, (not sure why I was thinking that).

At the end I have decided not to go with RV because in case of failure to update RC it wouldn't help us in this case although it would filter events that have already been processed and acted upon, but that's merely a performance improvement.

w.r.t. to my previous post: I am not entirely sure what's the sequence of events for the case when this is caused entirely by DeploymentConfig and Deployer controller actions, although I have some suspicions. But here is how it brakes in case the informer caches are not synced yet:

Deployer controller creates deployer pod.
1a. Updating phase to Pending fails (e.g. collision or other error.)
1b. Caches are not synced yet and some deployent has been re-queued and picked up the RC from unsynced cache (still in new phase)
User deletes deployer pod that has been running (say into 50% or just already running)
Deployer controller processes such deployment in phase NEW and deployer pod is missing so it creates a new one.

I need to check on Monday the use of expectation in all possible paths but it seems like a way forward. A more reliable one. (thx @smarterclayton for suggesting that.)

Although on a funny note, we are solving cache issues with yet another cache :)

tnozicka · 2017-09-08T21:22:28Z

/retest

0xmichalis · 2017-09-09T14:47:26Z

pkg/apps/controller/deployer/deployer_controller.go

@@ -223,6 +240,8 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
 		}

 	case deployapi.DeploymentStatusFailed:
+		c.expectations.ExpectDeletions(key, 0)


Shouldn't we expect a deletion here only if the deployment is cancelled? Also does this take into account hook pods?

The naming of such function is unfortunate. What it means here is that we have observed a final state; decrement a counter and allow for creating a new deployer to be created if needed - usually only on manual restart, but the intent is to prevent doing so in the interim

tnozicka · 2017-09-11T15:08:59Z

/retest

…won't start deployer pod more than once.

tnozicka · 2017-10-06T20:42:13Z

@mfojtik rebased

tnozicka · 2017-10-06T20:44:08Z

/retest

0xmichalis · 2017-10-09T10:19:04Z

/lint

openshift-ci-robot

@Kargakis: 1 warning.

In response to this:

/lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2017-10-09T10:21:12Z

pkg/apps/controller/deployer/factory.go

 	key, err := kcontroller.KeyFunc(rc)
 	if err != nil {
-		utilruntime.HandleError(fmt.Errorf("Couldn't get key for object %#v: %v", rc, err))
+		return key, fmt.Errorf("Couldn't get key for object %#v: %v", rc, err)


Golint errors: error strings should not be capitalized or end with punctuation or a newline. More info.

0xmichalis · 2017-10-09T10:23:01Z

pkg/apps/controller/deployer/deployer_controller.go

+			// In case the caches wouldn't be synced yet and we would receive the same RC with state New
+			// and the deployer pod would have been deleted (by user or otherwise) we would have recreated it again.
+			// Also a newer deployment might be already running so we would have 2 active deployer pods.
+			if c.expectations.SatisfiedExpectations(key) {


Can you remove the extra indentation by doing something like

if !c.expectations.SatisfiedExpectations(key) { break }

0xmichalis · 2017-10-09T10:25:08Z

pkg/apps/controller/deployer/deployer_controller.go

@@ -173,7 +189,8 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
 			// to ensure that changes to 'unrelated' pods don't result in updates to
 			// the deployment. So, the image check will have to be done in other areas
 			// of the code as well.
-			if deployutil.DeploymentNameFor(deployer) != deployment.Name {
+			controllerRef := kcontroller.GetControllerOf(deployer)
+			if deployutil.DeploymentNameFor(deployer) != deployment.Name || (controllerRef != nil && controllerRef.UID != deployment.UID) {


Seems like this check could be its own function. Also can you add a unit test that exercises both paths?

0xmichalis · 2017-10-09T10:26:37Z

pkg/apps/controller/deployer/deployer_controller.go

@@ -315,8 +336,12 @@ func (c *DeploymentController) nextStatus(pod *v1.Pod, deployment *v1.Replicatio
 			updatedAnnotations[deployapi.DeployerPodCompletedAtAnnotation] = completedTimestamp.String()
 		}
 		return deployapi.DeploymentStatusFailed
+
+	case v1.PodUnknown:


Seems redundant. Just use the default case.

unless you want to log this as weird

I wanted to explicitly state that we know about this case so no one reading the code would doubt if we account for that because that enum might be extended in future

funny fact: the PodUnknown is never being set in k8s code, so I guess this serves as a future placeholder :-)

So to get this straight, this code serves as a future placeholder for changes in the future.

future generations will face pods in unknown state (quantum containers?)

@Kargakis @mfojtik here are your quantum pods :) https://bugzilla.redhat.com/show_bug.cgi?id=1539143#c0

0xmichalis · 2017-10-09T10:27:22Z

pkg/apps/controller/deployer/deployer_controller.go

@@ -282,7 +303,7 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
 	return nil
 }

-func (c *DeploymentController) nextStatus(pod *v1.Pod, deployment *v1.ReplicationController, updatedAnnotations map[string]string) deployapi.DeploymentStatus {
+func (c *DeploymentController) nextStatus(current deployapi.DeploymentStatus, pod *v1.Pod, deployment *v1.ReplicationController, updatedAnnotations map[string]string) deployapi.DeploymentStatus {


Can't you infer current from deployment?

mfojtik · 2017-10-09T14:37:38Z

pkg/apps/controller/deployer/deployer_controller.go

@@ -97,6 +100,12 @@ type DeploymentController struct {
 // to a terminal deployment status. Since this controller started using caches,
 // the provided rc MUST be deep-copied beforehand (see work() in factory.go).
 func (c *DeploymentController) handle(deployment *v1.ReplicationController, willBeDropped bool) error {
+	key, err := getKeyForReplicationController(deployment)


why we doing this on top and not before we need it?

mfojtik · 2017-10-09T14:39:43Z

pkg/apps/controller/deployer/deployer_controller.go

@@ -246,6 +265,8 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
 		}

 	case deployapi.DeploymentStatusComplete:
+		c.expectations.ExpectDeletions(key, 0)


nit: ExpectDeletionCountForKey() or ExpectNoDeletions(key) for future generations.

mfojtik · 2017-10-09T14:40:28Z

pkg/apps/controller/deployer/factory.go

@@ -124,6 +126,15 @@ func (c *DeploymentController) updateReplicationController(old, cur interface{})
 	c.enqueueReplicationController(curRC)
 }

+func (c *DeploymentController) addPod(obj interface{}) {
+	pod := obj.(*v1.Pod)


pod, ok :=
if !ok {}

I would call panic in if either way, this calls it as well

mfojtik · 2017-10-09T14:40:59Z

pkg/apps/controller/deployer/factory.go

+	pod := obj.(*v1.Pod)
+
+	rc, err := c.rcForDeployerPod(pod)
+	if err == nil && rc != nil {


i would return an error in case there is no deployer pod for rc

stevekuznetsov · 2017-10-17T21:50:11Z

ping @tnozicka @Kargakis @mfojtik

tnozicka · 2017-10-18T08:27:11Z

@stevekuznetsov this is planed for 3.8 work as it introduces significant changes. So paused for a while.

It is still probably not the cause of seeing 2 deployer pods in real world as one was 2 triggers racing (solved now) and the other is w.r.t. cancellations causing the same effect.

openshift-merge-robot · 2017-11-20T15:29:04Z

@tnozicka PR needs rebase

openshift-bot · 2018-04-30T17:53:28Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-05-30T18:39:03Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2018-06-29T18:40:37Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

tnozicka · 2018-07-02T07:34:30Z

for the record: As it turned out the root cause of having more than one deployer pod created were informers going back in time because of broken watch cache, not this race (although it's there). This will be fixed with the new deployer controller rewrite.

tnozicka added the do-not-merge label Sep 4, 2017

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 4, 2017

openshift-merge-robot assigned soltysh and smarterclayton Sep 4, 2017

tnozicka assigned mfojtik and unassigned soltysh Sep 4, 2017

smarterclayton reviewed Sep 4, 2017

View reviewed changes

tnozicka mentioned this pull request Sep 4, 2017

Multiple running deployments reported in a test - invariant violated #16003

Closed

tnozicka requested a review from mfojtik September 5, 2017 10:43

stevekuznetsov changed the title ~~WIP - Make sure no more than one deployer pod is created~~ WIP - Make sure no more than one deployer pod is created Sep 7, 2017

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2017

tnozicka force-pushed the fix-deployer-invariants branch from 7683ce5 to 751301c Compare September 8, 2017 20:23

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 8, 2017

0xmichalis reviewed Sep 9, 2017

View reviewed changes

tnozicka force-pushed the fix-deployer-invariants branch from 8dba656 to 428d89f Compare October 6, 2017 19:02

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2017

Add controller expectations for Deployer controller and make sure it …

5da6d84

…won't start deployer pod more than once.

tnozicka force-pushed the fix-deployer-invariants branch from 428d89f to 5da6d84 Compare October 6, 2017 19:06

openshift-ci-robot reviewed Oct 9, 2017

View reviewed changes

0xmichalis reviewed Oct 9, 2017

View reviewed changes

mfojtik reviewed Oct 9, 2017

View reviewed changes

tnozicka added this to the 3.8.0 milestone Oct 9, 2017

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 20, 2017

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 30, 2018

openshift-ci-robot closed this Jun 29, 2018

tnozicka deleted the fix-deployer-invariants branch July 2, 2018 07:32

Make sure no more than one deployer pod is created #16129

Make sure no more than one deployer pod is created #16129

Conversation

tnozicka commented Sep 4, 2017

openshift-merge-robot commented Sep 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnozicka commented Sep 4, 2017 • edited Loading

tnozicka commented Sep 4, 2017

tnozicka commented Sep 4, 2017

smarterclayton commented Sep 4, 2017

tnozicka commented Sep 5, 2017

smarterclayton commented Sep 5, 2017

smarterclayton commented Sep 5, 2017

smarterclayton commented Sep 5, 2017

tnozicka commented Sep 6, 2017

smarterclayton commented Sep 7, 2017 via email

openshift-ci-robot commented Sep 7, 2017

tnozicka commented Sep 8, 2017

tnozicka commented Sep 8, 2017

Choose a reason for hiding this comment

tnozicka Sep 11, 2017 • edited Loading

Choose a reason for hiding this comment

tnozicka commented Sep 11, 2017

tnozicka commented Oct 6, 2017

tnozicka commented Oct 6, 2017

0xmichalis commented Oct 9, 2017

openshift-ci-robot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xmichalis Oct 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevekuznetsov commented Oct 17, 2017

tnozicka commented Oct 18, 2017

openshift-merge-robot commented Nov 20, 2017

openshift-bot commented Apr 30, 2018

openshift-bot commented May 30, 2018

openshift-bot commented Jun 29, 2018

tnozicka commented Jul 2, 2018

tnozicka commented Sep 4, 2017 •

edited

Loading

tnozicka Sep 11, 2017 •

edited

Loading

0xmichalis Oct 9, 2017 •

edited

Loading