Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure no more than one deployer pod is created #16129

Closed
wants to merge 1 commit into from

Conversation

tnozicka
Copy link
Contributor

@tnozicka tnozicka commented Sep 4, 2017

Deployer controller can create the same deployer pod more than once and even while a newer deployment is running. It already happened at least twice in our tests here: #16003

There it created a deployer pod, succeeded, deployed one more deployment and while deploying the third one it recreated deployer pod for the first one and run it again, in parallel with the latest one.

Two deployer invariants were broken by it:

  • it shouldn't run a deployer pod more than once
  • there can be only one active deployment

I couldn't find out what went wrong from the logs but my guess is that between creating deployer pod and updating the phase after, there are intermediate states that get queued up where the deployment is still in phase new but the deployer pod is running. Assuming the processing of them can delay arbitrarily and we manage to delete the deployer pod for at least one of those states the phase is new and the deployer is missing so it creates it (again).

Fixes: #16003

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 4, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tnozicka
We suggest the following additional approver: smarterclayton

Assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@tnozicka tnozicka assigned mfojtik and unassigned soltysh Sep 4, 2017
// We need to transition the deployment to "pending" to make sure we can't ever create the deployer pod twice.
// If we would fail to update phase after and the deployer pod would have gotten deleted in the meantime
// we would have recreated it here.
// The disadvantage is that if the Create call fails we will never create the deployer pod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that we have an old controller loop observing that it needs to create a pod. It's a retry? Or something else?

That doesn't mean this might not be the right choice, but I'm concerned about retry failing. The controller itself seems like it has a concurrency bug that should be fixed, rather than changing how this code runs.

Ie:

  1. Observe need for new deployment.
  2. Observe running rcs - after this point, the controller should never try to start an older rc than the one it sees
  3. Cancel older deployment pods
  4. Create deployment pod for most recent, but only if 2 still holds

Can we correct 4 to ensure the invariant from 2 inside the controller is repsect? This bug implies 4 is proceeding even though we started back at 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton I don't think that can be done. Controller processes queue in parallel for different keys (RCs). The issue is that you can't do atomic operation on more that one object.

See my comment in main section. (I didn't wanted it to get lost/hidden here in case of rebase.)

@tnozicka
Copy link
Contributor Author

tnozicka commented Sep 4, 2017

@smarterclayton Having to prove to you that the current design was racy actually might have helped me discover the reason why this happened.

Let's asses the current deployer design:
Because it creates the deployer first and then updates the phase (which is part of the state machine) it creates an "invalid" intermediary state. That makes it fully dependent on the assumption that the controller will process events in the order they have occurred. Which is not true!

Here is how it breaks:

  1. It creates deployer pod and then fails to update the phase. Sep 01 07:02:04 ci-prtest-5a37c28-7117-ig-m-sqk1 origin-master-controllers[9828]: I0901 07:02:04.530008 9828 deploymentconfig_controller.go:460] Error syncing deployment config extended-test-cli-deployment-jzj4g-fnb05/history-limit: Operation cannot be fulfilled on deploymentconfigs "history-limit": the object has been modified; please apply your changes to the latest version and try again
    The update is probably caused by some other controller e.g. updating state, resolving images, ... This might just be the reason why it started occurring now in our tests although it was racy probably from the start.
  2. It calls c.queue.AddRateLimited(key) https://github.com/tnozicka/origin/blob/4d3ae499e7e0206c79db93333a6d7aa4852ded52/pkg/deploy/controller/deployer/deployer_controller.go#L497 which causes it to be re-queued but after the rate limiting timer so the controller is free to proceed in between that time.
  3. It recovers (updates the phase) and completes deployment.
  4. Now it deletes that deployer pod since it's now in complete phase.
  5. And here comes the rate limited event and gets added to queue. It is old and the phase is still New and the deployer is missing => it creates a deployement.

@tnozicka
Copy link
Contributor Author

tnozicka commented Sep 4, 2017

And after 3. it can even successfully deploy more deployments. In the failed test it managed to create 1 successfully and while deploying the other one it failed on the test check for multiple running deployers.

I guess we can probably make some hacks and do a live get to check deployment resourceVersion before creating a new deployer. Although it might not be the best course of action.

@tnozicka
Copy link
Contributor Author

tnozicka commented Sep 4, 2017

cc @Kargakis

@smarterclayton
Copy link
Contributor

For your example, is an older version of the deployment in the cache (when the key comes out of the rate limited queue)? So even though we actually updated the phase in etcd, the latest event has not yet been delivered to our cache? If so, one way to manage that is through an "expectation" - marking that deployment as "up to date" with the newly updated RV or generation, so when we see the older event come into our queue we know to ignore it.

@tnozicka
Copy link
Contributor Author

tnozicka commented Sep 5, 2017

For your example, is an older version of the deployment in the cache (when the key comes out of the rate limited queue)?

Yes, newer RC events are already processed and after that the old event is added to the queue from waiting queue after the rate limiting period passes.

So even though we actually updated the phase in etcd, the latest event has not yet been delivered to our cache?

The latest event from that update has been delivered to our cache. The issue is that the old one got re-queued after that latest event has been already processed and since it's older it contains outdated RC.

If so, one way to manage that is through an "expectation" - marking that deployment as "up to date" with the newly updated RV or generation, so when we see the older event come into our queue we know to ignore it.

Yes, we could write a new ControllerExpectations to track processed objects in cache and disregard events with older resourceVersion. But:

I guess we shouldn't re-queue on Update conflict anyways since 409 has to mean that there is a newer RC event already in our queue so why re-enqueue the old one? Such events may actually fill up the queue because 409 will always repeat and re-queue over and over until they reach maxRequeueLimit - unless they manage to fail on something else in between like creating deployer pod that already exists.
But if there is any other error we have to re-enqueue anyways...

But that somehow affects just the specific case that broke our tests. The current approach still seems racy to me - a new example:

  1. Deployer controller processes event #0 and creates deployer pod but doesn't update RC yet.
  2. Deployer pod runs...
  3. (Some controller or user modifies RC - queues up event Add link to GOPATH doc and fixup typo #1 phase=New)
  4. User deletes deployer pod - queues up event Add alpha API documentation #2 phase=New
  5. Deployer controller catches up and tries to Update RC from phase New->Pending but fails because 409 (generally even without that for other reasons)
  6. Deployer controller processes event Add link to GOPATH doc and fixup typo #1 - phase=New and deployer pod is missing => creates deployer pod

To ensure that there are no more that one deployer pod it would be sufficient to do a live get and make sure the controller is still in the same phase.

To avoid this race of creating deployer pod more than once entirely we need to reorder the logic.

@smarterclayton
Copy link
Contributor

The issue is that the old one got re-queued after that latest event has been already processed and since it's older it contains outdated RC.

This controller should only be re-queueing the key, not the old version of the object. So if this was true, then the next controller one would have the newer RC? But it is totally possible for us to edit a RC, then not see it in the queue in time (for us to go ahead and process an older version of the object). That's what expectations are for.

@smarterclayton
Copy link
Contributor

It has to be TTL based to avoid accumulating memory over time and holding RV for every RC in the cluster in memory

It would be cleared by the next sync, and it's only going to be the key name, so at worst it's a few hundred k of strings you already have.

@smarterclayton
Copy link
Contributor

I couldn't find a guarantee that resourceVersion would be strictly monotonically increasing, although the current etcd implementation does say "monotonically-incrementing" so theoretically it might change although if it did resourceVersion would be really confusing without such guaranteee.

Use generation, but yes, it has that guarantee.

(Some controller or user modifies RC - queues up event #1 phase=New)
User deletes deployer pod - queues up event #2 phase=New

This is not possible if the events are both on RCs, or both on pods. We don't queue object versions, we queue keys. When the time comes to process the key, we can only ever get the latest version from the cache. There may be a latest version out on the cluster (that we haven't observed), but it won't be in our code. If you are dealing across multiple resources, then generally we would try to avoid the race by being level driven - only making decisions that are never undone.

In the example you gave, the deployer in step 1 should have set an expectation (before creating the pod) that a pod was created for a given RC. It should not have been cleared until step 5 completes successfully. Given that, one instance of a controller would not have a race (although in this example it's possible two could, although I'd have to think about it more)

@tnozicka
Copy link
Contributor Author

tnozicka commented Sep 6, 2017

Use generation, but yes, it has that guarantee.

We can't use generation for that because it is incremented only on important changes to spec whereas for RCs the phase is stored in annotations.
Seems like it would have to rely on resourceVersion.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 7, 2017 via email

@stevekuznetsov stevekuznetsov changed the title WIP - Make sure no more than one deployer pod is created WIP - Make sure no more than one deployer pod is created Sep 7, 2017
@openshift-ci-robot
Copy link

@tnozicka: Your pull request title starts with "WIP", so the do-not-merge/work-in-progress label will be added.

This label will ensure that your pull request will not be merged. Remove the prefix from your pull request title to trigger the removal of the label and allow for your pull request to be merged.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2017
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 8, 2017
@tnozicka
Copy link
Contributor Author

tnozicka commented Sep 8, 2017

and yes, we queue keys in this controller, (not sure why I was thinking that).

At the end I have decided not to go with RV because in case of failure to update RC it wouldn't help us in this case although it would filter events that have already been processed and acted upon, but that's merely a performance improvement.

w.r.t. to my previous post: I am not entirely sure what's the sequence of events for the case when this is caused entirely by DeploymentConfig and Deployer controller actions, although I have some suspicions. But here is how it brakes in case the informer caches are not synced yet:

  1. Deployer controller creates deployer pod.
    1a. Updating phase to Pending fails (e.g. collision or other error.)
    1b. Caches are not synced yet and some deployent has been re-queued and picked up the RC from unsynced cache (still in new phase)
  2. User deletes deployer pod that has been running (say into 50% or just already running)
  3. Deployer controller processes such deployment in phase NEW and deployer pod is missing so it creates a new one.

I need to check on Monday the use of expectation in all possible paths but it seems like a way forward. A more reliable one. (thx @smarterclayton for suggesting that.)

Although on a funny note, we are solving cache issues with yet another cache :)

@tnozicka
Copy link
Contributor Author

tnozicka commented Sep 8, 2017

/retest

@@ -223,6 +240,8 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
}

case deployapi.DeploymentStatusFailed:
c.expectations.ExpectDeletions(key, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we expect a deletion here only if the deployment is cancelled? Also does this take into account hook pods?

Copy link
Contributor Author

@tnozicka tnozicka Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of such function is unfortunate. What it means here is that we have observed a final state; decrement a counter and allow for creating a new deployer to be created if needed - usually only on manual restart, but the intent is to prevent doing so in the interim

@tnozicka
Copy link
Contributor Author

/retest

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2017
@tnozicka
Copy link
Contributor Author

tnozicka commented Oct 6, 2017

@mfojtik rebased

@tnozicka
Copy link
Contributor Author

tnozicka commented Oct 6, 2017

/retest

@0xmichalis
Copy link
Contributor

/lint

Copy link

@openshift-ci-robot openshift-ci-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis: 1 warning.

In response to this:

/lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

key, err := kcontroller.KeyFunc(rc)
if err != nil {
utilruntime.HandleError(fmt.Errorf("Couldn't get key for object %#v: %v", rc, err))
return key, fmt.Errorf("Couldn't get key for object %#v: %v", rc, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint errors: error strings should not be capitalized or end with punctuation or a newline. More info.

// In case the caches wouldn't be synced yet and we would receive the same RC with state New
// and the deployer pod would have been deleted (by user or otherwise) we would have recreated it again.
// Also a newer deployment might be already running so we would have 2 active deployer pods.
if c.expectations.SatisfiedExpectations(key) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the extra indentation by doing something like

if !c.expectations.SatisfiedExpectations(key) {
   break
}

@@ -173,7 +189,8 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
// to ensure that changes to 'unrelated' pods don't result in updates to
// the deployment. So, the image check will have to be done in other areas
// of the code as well.
if deployutil.DeploymentNameFor(deployer) != deployment.Name {
controllerRef := kcontroller.GetControllerOf(deployer)
if deployutil.DeploymentNameFor(deployer) != deployment.Name || (controllerRef != nil && controllerRef.UID != deployment.UID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this check could be its own function. Also can you add a unit test that exercises both paths?

@@ -315,8 +336,12 @@ func (c *DeploymentController) nextStatus(pod *v1.Pod, deployment *v1.Replicatio
updatedAnnotations[deployapi.DeployerPodCompletedAtAnnotation] = completedTimestamp.String()
}
return deployapi.DeploymentStatusFailed

case v1.PodUnknown:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant. Just use the default case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless you want to log this as weird

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to explicitly state that we know about this case so no one reading the code would doubt if we account for that because that enum might be extended in future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funny fact: the PodUnknown is never being set in k8s code, so I guess this serves as a future placeholder :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to get this straight, this code serves as a future placeholder for changes in the future.
past-and-future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future generations will face pods in unknown state (quantum containers?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -282,7 +303,7 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
return nil
}

func (c *DeploymentController) nextStatus(pod *v1.Pod, deployment *v1.ReplicationController, updatedAnnotations map[string]string) deployapi.DeploymentStatus {
func (c *DeploymentController) nextStatus(current deployapi.DeploymentStatus, pod *v1.Pod, deployment *v1.ReplicationController, updatedAnnotations map[string]string) deployapi.DeploymentStatus {
Copy link
Contributor

@0xmichalis 0xmichalis Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you infer current from deployment?

@@ -97,6 +100,12 @@ type DeploymentController struct {
// to a terminal deployment status. Since this controller started using caches,
// the provided rc MUST be deep-copied beforehand (see work() in factory.go).
func (c *DeploymentController) handle(deployment *v1.ReplicationController, willBeDropped bool) error {
key, err := getKeyForReplicationController(deployment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we doing this on top and not before we need it?

@@ -246,6 +265,8 @@ func (c *DeploymentController) handle(deployment *v1.ReplicationController, will
}

case deployapi.DeploymentStatusComplete:
c.expectations.ExpectDeletions(key, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ExpectDeletionCountForKey() or ExpectNoDeletions(key) for future generations.

@@ -124,6 +126,15 @@ func (c *DeploymentController) updateReplicationController(old, cur interface{})
c.enqueueReplicationController(curRC)
}

func (c *DeploymentController) addPod(obj interface{}) {
pod := obj.(*v1.Pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod, ok :=
if !ok {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call panic in if either way, this calls it as well

pod := obj.(*v1.Pod)

rc, err := c.rcForDeployerPod(pod)
if err == nil && rc != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would return an error in case there is no deployer pod for rc

@tnozicka tnozicka added this to the 3.8.0 milestone Oct 9, 2017
@stevekuznetsov
Copy link
Contributor

ping @tnozicka @Kargakis @mfojtik

@tnozicka
Copy link
Contributor Author

@stevekuznetsov this is planed for 3.8 work as it introduces significant changes. So paused for a while.

It is still probably not the cause of seeing 2 deployer pods in real world as one was 2 triggers racing (solved now) and the other is w.r.t. cancellations causing the same effect.

@openshift-merge-robot
Copy link
Contributor

@tnozicka PR needs rebase

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 20, 2017
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 30, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@tnozicka tnozicka deleted the fix-deployer-invariants branch July 2, 2018 07:32
@tnozicka
Copy link
Contributor Author

tnozicka commented Jul 2, 2018

for the record: As it turned out the root cause of having more than one deployer pod created were informers going back in time because of broken watch cache, not this race (although it's there). This will be fixed with the new deployer controller rewrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants