Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s ingress: 502s from ALB when using multiple hosts #964

Closed
herojan opened this issue Feb 22, 2019 · 8 comments · Fixed by #988
Closed

K8s ingress: 502s from ALB when using multiple hosts #964

herojan opened this issue Feb 22, 2019 · 8 comments · Fixed by #988
Labels
bugfix Bug fixes and patches

Comments

@herojan
Copy link
Contributor

herojan commented Feb 22, 2019

Hi, I've had a look and I don't see similar issues so I'll just provide the information here.

Using the K8s ingress controller with three different hosts, there are bursts of 502 errors happening regularly each day, often several times per day.

These 502 errors are happening at the ALB level, they never reach skipper. We know this because skipper does not print any logs about it and because the 502 returns html in the formatAWS uses, instead of the way skipper does.

Sample log statement from the application:

2019-02-22 05:19:52,883 FlowId(d34586fd-5155-719a-e039-0471f73193d3) WARN http-apr-37020-exec-19 de.zalando.catalog.service.spp.common.client.Retryable: (tx: d99cbb3b136f453aad987ac2c5e1efac) Operation failed. Retry 1 out of 3. Cause: [502]: <html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
</body>
</html>

Sample of the ELB 02 errors over the past week:
image

Sample ingress file with hostnames and paths etc stripped out:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    zalando.org/skipper-filter: filters
    zalando.org/skipper-predicate: predicates
      && Method("GET")
  creationTimestamp: 2019-01-11T09:07:35Z
  generation: 1
  labels:
    controller-uid: e9913af0-fd39-11e8-b6ef-02e0019638c0
  name: my_ingress
  namespace: default
spec:
  rules:
  - host: host1
    http:
      paths:
      - backend:
          serviceName: service
          servicePort: http
  - host: host2
    http:
      paths:
      - backend:
          serviceName: service
          servicePort: http
  - host: host3
    http:
      paths:
      - backend:
          serviceName: service
          servicePort: http
status:
  loadBalancer:
    ingress:
    - hostname: elb_host

It reminds me of another issue where skipper was not receiving information about new pod ips quickly enough when nodes were rotated, and so redirected people to the old pod ips. I wonder if this is similar but a level up, with the ELB not getting information about skipper pod rotations quickly enough and directing to old skipper ips.

Let me know if more information is needed.

Edit: Updated the ELB image since I had linked the wrong one originally.

@seancurran157
Copy link

seancurran157 commented Feb 27, 2019

Also experiencing this issue but we only have one host in our ingress manifest. Talking with AWS they pulled logs from the ALB associated with the issue: "upstream prematurely closed connection while reading response header from upstream".

skipper version v0.10.150
platform: EKS

@szuecs
Copy link
Member

szuecs commented Feb 28, 2019

@seancurran157 thanks for commenting.

From my understanding the message AWS wrote, means that ALB timedout in the backend call to skipper.
This can also happen if the proxy call from skipper to a backend times out and the ALB has lower timeouts for that call than skipper.

@seancurran157
Copy link

We currently have skipper timeout lower than the ALB. Looks like skipper may not be responding with a 504 when the back end timeouts and the ALB responds with a 502.

@szuecs szuecs added the bugfix Bug fixes and patches label Mar 7, 2019
@szuecs
Copy link
Member

szuecs commented Mar 7, 2019

TODO:

@olevchyk
Copy link

olevchyk commented May 7, 2019

Also experiencing similar issue, @herojan @seancurran157 could you please confirm #998 has fixed the issue or you still see 502 appearing from time to time ?

@szuecs
Copy link
Member

szuecs commented May 7, 2019

  • check node TCP keep-alive values > ALB/TG idle timeout
  • check idle timeout in skipper -idle-timeout-server= > idle timeout in ALB/TG

@herojan
Copy link
Contributor Author

herojan commented May 8, 2019

@olevchyk I asked the team who reported it and they said they haven't seen the issue in the past 15 days worth of logs, so it seems to be gone

@szuecs
Copy link
Member

szuecs commented May 17, 2019

For the records: skipper -idle-timeout-server=62s with idle timeout in ALB/TG 60s works and removes the 502 from ALB TGs to skipper, because of the race condition that skipper might close faster the idle connection than the ALB. Both loadbalancers have as default idle timeouts of 60s within their connection to each other. The 2nd layer LB should have a longer idle connection timeout to not disrupt an ongoing connection from the 1st LB pov.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Bug fixes and patches
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants