Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting 502 when running a deployment rolling update #1124

Closed
idanya opened this issue Jan 9, 2020 · 16 comments
Closed

Getting 502 when running a deployment rolling update #1124

idanya opened this issue Jan 9, 2020 · 16 comments

Comments

@idanya
Copy link

idanya commented Jan 9, 2020

When running a rolling update of a deployment, the ALB returns a lot of 502's.
Seems like the ALB is not synced to the actual pod state in k8s.

I can see that when a pod is being replaced, the alb controller registers the new pod in the target group and removes the old one. The problem is that the ALB state is for the new pod is initial and for the old one is draining - causing the service to be unavailable and return 502.

Service is completely unavailable when it's a single pod service, with bigger services with multiple pods I can see there is a spike in 502's which resolves itself after a while (all new pods gets to the healthy state in the alb).

Ideally, the old pod should not terminate before the new one gets to a healthy state in the ALB. Of course, k8s is not aware of that.

Is this a known issue? any acceptable workarounds?

@gaganapplatix
Copy link

You need the readiness gate if you are on flat networking and Pods are direct targets for the ALB. There is a PR sitting around which needs to be merged for this

#955

Without that the deployment controller will move forward with its rolling updates regardless of whether target groups are up to date with regard to the new pods.

@wadefelix
Copy link

don't update the service.

@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Mar 24, 2020

@wadefelix i believe this is fixed with pod readiness probe.
The caveat is you need to add the pod readiness probe manually for now with a 40 sec sleep.

If it don't work, please send me with a repro case.

Sample pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-dp
spec:
  replicas: 200
  selector:
    matchLabels:
      app: my-dp
  strategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: my-dp
    spec:
      readinessGates:
      - conditionType: target-health.alb.ingress.k8s.aws/my-ingress_my-dp_80
      containers:
      - name: server
        image: xx
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
        env:
        - name: MESSAGE
          value: The Doctor
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 40"]
      terminationGracePeriodSeconds: 70

@billyshambrook
Copy link

@M00nF1sh what's the purpose for the lifecycle hook if we have the pod readiness gate?

@M00nF1sh
Copy link
Collaborator

@billyshambrook pod readiness gate only guarantee you always have pods for new connection during deployment(e.g. when controller experiencing timeouts for AWS APIs, the deployment is paused).
But it don't protect existing connection and new connection to old pods. Controller deregisterTarget takes time, ALB propagate targets changes to its nodes takes time.

@billyshambrook
Copy link

Thanks @M00nF1sh for the fast response. Does kubernetes provide a hook like pod readiness gate but for when a external process (alb in this case) has finished before it starts to terminate without having to rely on a sleep? Though you would then need to have a way of knowing once the ALB has propagated the drain request...

Just wondering what would be the next steps to take to harden this even more

@M00nF1sh
Copy link
Collaborator

M00nF1sh commented Mar 25, 2020

@billyshambrook
Currently the only thing to prevent pod container deletion is lifecycleHooks. (finalizer won't work).
Ideally the lifecyleHook can query a in-cluster service hosted by this controller for removal status(too complicated than a 40 sec sleep TBH :D)

@wadefelix
Copy link

@wadefelix i believe this is fixed with pod readiness probe.
The caveat is you need to add the pod readiness probe manually for now with a 40 sec sleep.

If it don't work, please send me with a repro case.

Sample pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-dp
spec:
  replicas: 200
  selector:
    matchLabels:
      app: my-dp
  strategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: my-dp
    spec:
      readinessGates:
      - conditionType: target-health.alb.ingress.k8s.aws/my-ingress_my-dp_80
      containers:
      - name: server
        image: xx
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
        env:
        - name: MESSAGE
          value: The Doctor
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 40"]
      terminationGracePeriodSeconds: 70

If the svc object is update/recreate, the nodePort is changed by eks, So the alb cannot redirect the requests to the old pods.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2020
@pusherman
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2020
@idanya
Copy link
Author

idanya commented Jul 1, 2020

Fixed in v1.1.6 with readiness gates #955

@idanya idanya closed this as completed Jul 1, 2020
@varuzam
Copy link

varuzam commented Apr 16, 2021

The problem is still actual.
I use v2.1.3. and only lifecycle preStop helped to avoid 50x errors during deploy.

@simmessa
Copy link

simmessa commented Jul 6, 2021

I can confirm what @varuzam said, this issue is still here and having a workaround like that as our only option for zero downtime on EKS + ALB controller is frankly unacceptable for what aims to be "enterprise-grade", hopefully something will be done about this.

@ajaymdesai
Copy link

We are running into this every time we deploy 502 alarms from our upstream services

@dansd
Copy link

dansd commented Sep 1, 2022

I can confirm this is still an issue. When upgrading a deployment to a new version, the moment the new pod becomes ready for Kubernetes, the load balancer returns 502 for 5-10 seconds. This is the state of the target groups when the application is unavailable through the load balancer:
image

@robinmonjo
Copy link

Facing the exact same issue, would love to have a proper solution to this 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests