Getting 502 when running a deployment rolling update #1124

idanya · 2020-01-09T11:49:16Z

When running a rolling update of a deployment, the ALB returns a lot of 502's.
Seems like the ALB is not synced to the actual pod state in k8s.

I can see that when a pod is being replaced, the alb controller registers the new pod in the target group and removes the old one. The problem is that the ALB state is for the new pod is initial and for the old one is draining - causing the service to be unavailable and return 502.

Service is completely unavailable when it's a single pod service, with bigger services with multiple pods I can see there is a spike in 502's which resolves itself after a while (all new pods gets to the healthy state in the alb).

Ideally, the old pod should not terminate before the new one gets to a healthy state in the ALB. Of course, k8s is not aware of that.

Is this a known issue? any acceptable workarounds?

The text was updated successfully, but these errors were encountered:

gaganapplatix · 2020-01-12T23:40:26Z

You need the readiness gate if you are on flat networking and Pods are direct targets for the ALB. There is a PR sitting around which needs to be merged for this

#955

Without that the deployment controller will move forward with its rolling updates regardless of whether target groups are up to date with regard to the new pods.

wadefelix · 2020-03-22T05:17:21Z

don't update the service.

M00nF1sh · 2020-03-24T18:49:21Z

@wadefelix i believe this is fixed with pod readiness probe.
The caveat is you need to add the pod readiness probe manually for now with a 40 sec sleep.

If it don't work, please send me with a repro case.

Sample pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-dp
spec:
  replicas: 200
  selector:
    matchLabels:
      app: my-dp
  strategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: my-dp
    spec:
      readinessGates:
      - conditionType: target-health.alb.ingress.k8s.aws/my-ingress_my-dp_80
      containers:
      - name: server
        image: xx
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
        env:
        - name: MESSAGE
          value: The Doctor
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 40"]
      terminationGracePeriodSeconds: 70

billyshambrook · 2020-03-25T03:27:46Z

@M00nF1sh what's the purpose for the lifecycle hook if we have the pod readiness gate?

M00nF1sh · 2020-03-25T03:40:34Z

@billyshambrook pod readiness gate only guarantee you always have pods for new connection during deployment(e.g. when controller experiencing timeouts for AWS APIs, the deployment is paused).
But it don't protect existing connection and new connection to old pods. Controller deregisterTarget takes time, ALB propagate targets changes to its nodes takes time.

billyshambrook · 2020-03-25T04:03:01Z

Thanks @M00nF1sh for the fast response. Does kubernetes provide a hook like pod readiness gate but for when a external process (alb in this case) has finished before it starts to terminate without having to rely on a sleep? Though you would then need to have a way of knowing once the ALB has propagated the drain request...

Just wondering what would be the next steps to take to harden this even more

M00nF1sh · 2020-03-25T04:25:29Z

@billyshambrook
Currently the only thing to prevent pod container deletion is lifecycleHooks. (finalizer won't work).
Ideally the lifecyleHook can query a in-cluster service hosted by this controller for removal status(too complicated than a 40 sec sleep TBH :D)

wadefelix · 2020-04-02T07:47:37Z

@wadefelix i believe this is fixed with pod readiness probe.
The caveat is you need to add the pod readiness probe manually for now with a 40 sec sleep.

If it don't work, please send me with a repro case.

Sample pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-dp
spec:
  replicas: 200
  selector:
    matchLabels:
      app: my-dp
  strategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: my-dp
    spec:
      readinessGates:
      - conditionType: target-health.alb.ingress.k8s.aws/my-ingress_my-dp_80
      containers:
      - name: server
        image: xx
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
        env:
        - name: MESSAGE
          value: The Doctor
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 40"]
      terminationGracePeriodSeconds: 70

If the svc object is update/recreate, the nodePort is changed by eks, So the alb cannot redirect the requests to the old pods.

fejta-bot · 2020-07-01T08:26:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

pusherman · 2020-07-01T12:11:00Z

/remove-lifecycle stale

idanya · 2020-07-01T12:13:20Z

Fixed in v1.1.6 with readiness gates #955

varuzam · 2021-04-16T22:06:10Z

The problem is still actual.
I use v2.1.3. and only lifecycle preStop helped to avoid 50x errors during deploy.

simmessa · 2021-07-06T14:32:47Z

I can confirm what @varuzam said, this issue is still here and having a workaround like that as our only option for zero downtime on EKS + ALB controller is frankly unacceptable for what aims to be "enterprise-grade", hopefully something will be done about this.

ajaymdesai · 2021-10-07T03:58:17Z

We are running into this every time we deploy 502 alarms from our upstream services

dansd · 2022-09-01T08:08:02Z

I can confirm this is still an issue. When upgrading a deployment to a new version, the moment the new pod becomes ready for Kubernetes, the load balancer returns 502 for 5-10 seconds. This is the state of the target groups when the application is unavailable through the load balancer:

robinmonjo · 2023-01-26T10:16:41Z

Facing the exact same issue, would love to have a proper solution to this 😊

jorihardman mentioned this issue Jan 16, 2020

Ingress controller did not remove targets from target group even when pods were deleted #1131

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2020

idanya closed this as completed Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting 502 when running a deployment rolling update #1124

Getting 502 when running a deployment rolling update #1124

idanya commented Jan 9, 2020

gaganapplatix commented Jan 12, 2020

wadefelix commented Mar 22, 2020

M00nF1sh commented Mar 24, 2020 •

edited

Loading

billyshambrook commented Mar 25, 2020

M00nF1sh commented Mar 25, 2020

billyshambrook commented Mar 25, 2020

M00nF1sh commented Mar 25, 2020 •

edited

Loading

wadefelix commented Apr 2, 2020

fejta-bot commented Jul 1, 2020

pusherman commented Jul 1, 2020

idanya commented Jul 1, 2020

varuzam commented Apr 16, 2021

simmessa commented Jul 6, 2021

ajaymdesai commented Oct 7, 2021

dansd commented Sep 1, 2022

robinmonjo commented Jan 26, 2023

Getting 502 when running a deployment rolling update #1124

Getting 502 when running a deployment rolling update #1124

Comments

idanya commented Jan 9, 2020

gaganapplatix commented Jan 12, 2020

wadefelix commented Mar 22, 2020

M00nF1sh commented Mar 24, 2020 • edited Loading

billyshambrook commented Mar 25, 2020

M00nF1sh commented Mar 25, 2020

billyshambrook commented Mar 25, 2020

M00nF1sh commented Mar 25, 2020 • edited Loading

wadefelix commented Apr 2, 2020

fejta-bot commented Jul 1, 2020

pusherman commented Jul 1, 2020

idanya commented Jul 1, 2020

varuzam commented Apr 16, 2021

simmessa commented Jul 6, 2021

ajaymdesai commented Oct 7, 2021

dansd commented Sep 1, 2022

robinmonjo commented Jan 26, 2023

M00nF1sh commented Mar 24, 2020 •

edited

Loading

M00nF1sh commented Mar 25, 2020 •

edited

Loading