-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting 502 when running a deployment rolling update #1124
Comments
You need the readiness gate if you are on flat networking and Pods are direct targets for the ALB. There is a PR sitting around which needs to be merged for this Without that the deployment controller will move forward with its rolling updates regardless of whether target groups are up to date with regard to the new pods. |
don't update the service. |
@wadefelix i believe this is fixed with pod readiness probe. If it don't work, please send me with a repro case. Sample pod spec:
|
@M00nF1sh what's the purpose for the lifecycle hook if we have the pod readiness gate? |
@billyshambrook pod readiness gate only guarantee you always have pods for new connection during deployment(e.g. when controller experiencing timeouts for AWS APIs, the deployment is paused). |
Thanks @M00nF1sh for the fast response. Does kubernetes provide a hook like pod readiness gate but for when a external process (alb in this case) has finished before it starts to terminate without having to rely on a sleep? Though you would then need to have a way of knowing once the ALB has propagated the drain request... Just wondering what would be the next steps to take to harden this even more |
@billyshambrook |
If the svc object is update/recreate, the nodePort is changed by eks, So the alb cannot redirect the requests to the old pods. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Fixed in v1.1.6 with readiness gates #955 |
The problem is still actual. |
I can confirm what @varuzam said, this issue is still here and having a workaround like that as our only option for zero downtime on EKS + ALB controller is frankly unacceptable for what aims to be "enterprise-grade", hopefully something will be done about this. |
We are running into this every time we deploy 502 alarms from our upstream services |
Facing the exact same issue, would love to have a proper solution to this 😊 |
When running a rolling update of a deployment, the ALB returns a lot of 502's.
Seems like the ALB is not synced to the actual pod state in k8s.
I can see that when a pod is being replaced, the alb controller registers the new pod in the target group and removes the old one. The problem is that the ALB state is for the new pod is
initial
and for the old one isdraining
- causing the service to be unavailable and return 502.Service is completely unavailable when it's a single pod service, with bigger services with multiple pods I can see there is a spike in 502's which resolves itself after a while (all new pods gets to the healthy state in the alb).
Ideally, the old pod should not terminate before the new one gets to a
healthy
state in the ALB. Of course, k8s is not aware of that.Is this a known issue? any acceptable workarounds?
The text was updated successfully, but these errors were encountered: