-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple running deployments reported in a test - invariant violated #16003
Comments
I was unable to find an occurrence of this in the last two days except today (twice, it looks like) |
@tnozicka FYI |
@mfojtik I am already checking what's wrong with that. Also https://github.com/openshift/origin/pull/14954/files#diff-f2c8446f523f5b04af6def6670658089R939 started suddenly failing on the weekend but that might be just a coincidence) |
The reason why it's failing now might be OOMs like:
Still looking into the reason why the controller breaks the invariant in this case. Working theory is that it considers that deployment (RC) failed and continues with a new deploymnet but then the previously killed deploy-pod gets rescheduled to another node and runs again
|
Hrm, nothing should redeploy that pod, because there is no "reschedule
component" (the pod is just evicted by the node, and an RC/RS has to create
a new pod).
…On Mon, Aug 28, 2017 at 9:25 AM, Tomáš Nožička ***@***.***> wrote:
The reason why it's failing now might be OOMs like:
Aug 26 19:13:06.210: INFO: At 2017-08-26 19:12:48 -0400 EDT - event for history-limit-5-deploy: {kubelet ci-primg406-ig-n-ftqj} Killing: Killing container with id docker://deployment:Need to kill Pod
Still looking into the reason why the controller break the invariant in
this case. Working theory is that it considers that deployment (RC) failed
and continues with a new deploymnet but then the previously killed
deploy-pod gets rescheduled to another node and runs again
Aug 26 19:13:06.210: INFO: At 2017-08-26 19:13:00 -0400 EDT - event for history-limit-5-deploy: {default-scheduler } Scheduled: Successfully assigned history-limit-5-deploy to ci-primg406-ig-n-hs6q
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16003 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pxKx6FoVa4nH6FJAGLX9hL4CTPv2ks5scr_agaJpZM4PDpz3>
.
|
for the record the reason why it fails: #16129 (comment) |
This is must have for 3.7 - it should still be p1 because we don't want to
ship without #16129. We can't just break deployments and still ship.
…On Fri, Oct 6, 2017 at 6:26 AM, Tomáš Nožička ***@***.***> wrote:
I am lowering this to P2 because we just merged #14910
<#14910> that should lower the
changes of this happening significantly to the level it always was. There
is still a very subtle race condition dependent on caches being out of sync
that is addressed in #16129
<#16129> which is waiting on
reviews.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16003 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p9Umtn3DqC_RmNDjyFnOvcwzblMmks5spgBlgaJpZM4PDpz3>
.
|
@smarterclayton the DC controller was creating the same deployment twice which I think was the cause of this issue - see #16671 that was fixed by #14910 The subtle race being fixed in #16129 was always there but I don't mind having that for 3.7 if there is someone to review |
If the race is not a regression then yeah p2 is fine.
…On Fri, Oct 6, 2017 at 11:29 AM, Tomáš Nožička ***@***.***> wrote:
@smarterclayton <https://github.com/smarterclayton> the DC controller was
creating the same deployment twice which I think was the cause of this
issue - see #16671 <#16671>
that was fixed by #14910 <#14910>
The subtle race being fixed in #16129
<#16129> was always there but I
don't mind having that for 3.7 if there is someone to review
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16003 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pyARhmMXdymhs-ty1eYcttlfCrDMks5spkc8gaJpZM4PDpz3>
.
|
Talked to @mfojtik and we are leaving this for early 3.8 as that's fairly big change this late and it's not a regression. |
I am seeing this in After upgrade to Openshift 3.7 There is always one more cancelled deployment with reason "newer deployment was found running". Is there a bugreport for this to RH in Bugzilla? |
@bortek this race should be extremely rare and has always been there. The confirmation would be to see 2 deployer pods for that deployment running at the same time. If so that is more likely caused by rouge kubelet:
The usual suspect for this would be 2 subsequent triggers (configChange, imageChange). Best to file an issue with steps to reproduce so we can look into it. |
I am pretty sure now I am ah hitting this bug. I tested on two different applications in different namespaces and they both behave like that. Yes they have both configChange & imageChange triggers. It happens when I start a new build which when completed initiates new deployment. It does not help if I disable configChange trigger. (Cant disable imageChange as I need it). Steps to reproduce... I just start a new build and wait till its done followed by a deploy. Here is a screenshot how it looks like. |
@bortek Unless you see 2 deployer pods running at the same time it isn't this bug. Say like while true; do echo "=========" && oc get pods -o go-template --template '{{ range $i, $elem := .items }}{{ printf "%s - %s\n" .metadata.name .status.phase }}{{end}}'; done Please open a new bug stating your exact openshift version, and the output of |
Automatic merge from submit-queue (batch tested with PRs 18233, 18068, 18228, 18227). UPSTREAM: 58547: Send correct resource version for delete events from watch cache Backport of kubernetes/kubernetes#58547 Watch cache was returning incorrect (old) ResourceVersion on "deleted" events breaking informers that were going back in time. This fixes it. /assign @liggitt /cc @mfojtik Fixes #17581 #16003 and likely others
…e-fix-58547 Automatic merge from submit-queue (batch tested with PRs 18233, 18068, 18228, 18227). UPSTREAM: 58547: Send correct resource version for delete events from watch cache Backport of kubernetes#58547 Watch cache was returning incorrect (old) ResourceVersion on "deleted" events breaking informers that were going back in time. This fixes it. /assign @liggitt /cc @mfojtik Fixes openshift/origin#17581 openshift/origin#16003 and likely others Origin-commit: 042a63f8c1effc2fb911ce2cf494458872e9f8a3
…e-fix-58547 Automatic merge from submit-queue (batch tested with PRs 18233, 18068, 18228, 18227). UPSTREAM: 58547: Send correct resource version for delete events from watch cache Backport of kubernetes#58547 Watch cache was returning incorrect (old) ResourceVersion on "deleted" events breaking informers that were going back in time. This fixes it. /assign @liggitt /cc @mfojtik Fixes openshift/origin#17581 openshift/origin#16003 and likely others Origin-commit: 042a63f8c1effc2fb911ce2cf494458872e9f8a3
Automatic merge from submit-queue (batch tested with PRs 18233, 18068, 18228, 18227). UPSTREAM: 58547: Send correct resource version for delete events from watch cache Backport of kubernetes/kubernetes#58547 Watch cache was returning incorrect (old) ResourceVersion on "deleted" events breaking informers that were going back in time. This fixes it. /assign @liggitt /cc @mfojtik Fixes openshift/origin#17581 openshift/origin#16003 and likely others Origin-commit: 042a63f8c1effc2fb911ce2cf494458872e9f8a3 Kubernetes-commit: b1d49808af3db35be42e4b705953d656a21bc201
watch cache issue and kubelet should now be fixed; haven't seen this for a while |
https://ci.openshift.redhat.com/jenkins/job/zz_origin_gce_image/406/testReport/junit/(root)/Extended/deploymentconfigs_with_revision_history_limits__Conformance__should_never_persist_more_old_deployments_than_acceptable_after_being_observed_by_the_controller/
The text was updated successfully, but these errors were encountered: