-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod readiness gates #955
Pod readiness gates #955
Conversation
Hi @devkid. Thanks for your PR. I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
+1 on this feature, we'd love to get it |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle |
/remove-lifecycle rotten |
…eady` when it's healthy and taking traffic from ALB
d8befc9
to
b134750
Compare
I think we should start with the per-target group approach. We can add the per-ingress functionality later if required.
I don't think we should have a timeout for this. The whole idea of readiness gates is that they should block deployments if the pods don't get registered in the load balancer (for whatever reason). Progressing with deployment when the readiness gate is not ok yet will most likely bring the service down eventually (e.g. when – for any reason – the ALB ingress controller is down). |
overall looks good to me 👍 just we should remove the |
* remove reconciliation strategy (always use `initial`) * remove reconciliation interval (use healthcheckIntervalSeconds * healthyThresholdCount initially, then only healthcheckIntervalSeconds) Also: * include ALB target health description in the `Reason` field of pod condition * initialize and pass the TargetHealthController in TargetGroupController instead of TargetsController (one level above)
@M00nF1sh removed the strategy and interval. It uses now |
@M00nF1sh is there anything else that prevents from merging? Or do you need to coordinate this with the v2 work? |
@devkid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: devkid, M00nF1sh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/woof |
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
if err != nil { | ||
continue | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@M00nF1sh @devkid shouldn't we also check if DeletionTimestamp is != nil? because pods, that are in state terminating can still be found in the store, but should already be set into state "draining" on the TG, right? I can still see quite a substantial amount of 5xx errors (502, 504) in our testcase, when the pods are actually receiving their sigterm signal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pods in Terminating
state appear neither in Addresses
nor in NotReadyAddresses
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️ for us they do, that's why I mentioned it here. Using EKS 1.14 and publishNotReadyAddresses on the service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird. I actually thought about implementing the check for DeletionTimestamp != nil
but I verified on our end that terminating pods do not appear in NotReadyAddresses
. Feel free to add the check. I'll have another look at this on our end as well next week.
…ess-gate Pod readiness gates
This adds a feature to set the status of pod readiness gates on pods that are registered with an ALB. See added documentation for details of implementation.
Todo:
This closes #905.