K8s ingress: 502s from ALB when using multiple hosts #964

herojan · 2019-02-22T14:51:23Z

Hi, I've had a look and I don't see similar issues so I'll just provide the information here.

Using the K8s ingress controller with three different hosts, there are bursts of 502 errors happening regularly each day, often several times per day.

These 502 errors are happening at the ALB level, they never reach skipper. We know this because skipper does not print any logs about it and because the 502 returns html in the formatAWS uses, instead of the way skipper does.

Sample log statement from the application:

2019-02-22 05:19:52,883 FlowId(d34586fd-5155-719a-e039-0471f73193d3) WARN http-apr-37020-exec-19 de.zalando.catalog.service.spp.common.client.Retryable: (tx: d99cbb3b136f453aad987ac2c5e1efac) Operation failed. Retry 1 out of 3. Cause: [502]: <html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
</body>
</html>

Sample of the ELB 02 errors over the past week:

Sample ingress file with hostnames and paths etc stripped out:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    zalando.org/skipper-filter: filters
    zalando.org/skipper-predicate: predicates
      && Method("GET")
  creationTimestamp: 2019-01-11T09:07:35Z
  generation: 1
  labels:
    controller-uid: e9913af0-fd39-11e8-b6ef-02e0019638c0
  name: my_ingress
  namespace: default
spec:
  rules:
  - host: host1
    http:
      paths:
      - backend:
          serviceName: service
          servicePort: http
  - host: host2
    http:
      paths:
      - backend:
          serviceName: service
          servicePort: http
  - host: host3
    http:
      paths:
      - backend:
          serviceName: service
          servicePort: http
status:
  loadBalancer:
    ingress:
    - hostname: elb_host

It reminds me of another issue where skipper was not receiving information about new pod ips quickly enough when nodes were rotated, and so redirected people to the old pod ips. I wonder if this is similar but a level up, with the ELB not getting information about skipper pod rotations quickly enough and directing to old skipper ips.

Let me know if more information is needed.

Edit: Updated the ELB image since I had linked the wrong one originally.

The text was updated successfully, but these errors were encountered:

seancurran157 · 2019-02-27T16:11:49Z

Also experiencing this issue but we only have one host in our ingress manifest. Talking with AWS they pulled logs from the ALB associated with the issue: "upstream prematurely closed connection while reading response header from upstream".

skipper version v0.10.150
platform: EKS

szuecs · 2019-02-28T16:10:10Z

@seancurran157 thanks for commenting.

From my understanding the message AWS wrote, means that ALB timedout in the backend call to skipper.
This can also happen if the proxy call from skipper to a backend times out and the ALB has lower timeouts for that call than skipper.

seancurran157 · 2019-03-01T14:59:02Z

We currently have skipper timeout lower than the ALB. Looks like skipper may not be responding with a 504 when the back end timeouts and the ALB responds with a 502.

szuecs · 2019-03-07T10:00:25Z

TODO:

check if K8s ingress: 502s from ALB when using multiple hosts #964 (comment) can be reproduced
fix to return 504

olevchyk · 2019-05-07T10:14:37Z

Also experiencing similar issue, @herojan @seancurran157 could you please confirm #998 has fixed the issue or you still see 502 appearing from time to time ?

szuecs · 2019-05-07T12:35:20Z

check node TCP keep-alive values > ALB/TG idle timeout
check idle timeout in skipper -idle-timeout-server= > idle timeout in ALB/TG

herojan · 2019-05-08T11:36:26Z

@olevchyk I asked the team who reported it and they said they haven't seen the issue in the past 15 days worth of logs, so it seems to be gone

szuecs · 2019-05-17T09:22:59Z

For the records: skipper -idle-timeout-server=62s with idle timeout in ALB/TG 60s works and removes the 502 from ALB TGs to skipper, because of the race condition that skipper might close faster the idle connection than the ALB. Both loadbalancers have as default idle timeouts of 60s within their connection to each other. The 2nd layer LB should have a longer idle connection timeout to not disrupt an ongoing connection from the 1st LB pov.

szuecs added the bugfix Bug fixes and patches label Mar 7, 2019

szuecs mentioned this issue Mar 12, 2019

Fix/graceful shutdown #988

Merged

szuecs closed this as completed in #988 Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s ingress: 502s from ALB when using multiple hosts #964

K8s ingress: 502s from ALB when using multiple hosts #964

herojan commented Feb 22, 2019 •

edited

Loading

seancurran157 commented Feb 27, 2019 •

edited

Loading

szuecs commented Feb 28, 2019

seancurran157 commented Mar 1, 2019

szuecs commented Mar 7, 2019 •

edited

Loading

olevchyk commented May 7, 2019

szuecs commented May 7, 2019 •

edited

Loading

herojan commented May 8, 2019

szuecs commented May 17, 2019

K8s ingress: 502s from ALB when using multiple hosts #964

K8s ingress: 502s from ALB when using multiple hosts #964

Comments

herojan commented Feb 22, 2019 • edited Loading

seancurran157 commented Feb 27, 2019 • edited Loading

szuecs commented Feb 28, 2019

seancurran157 commented Mar 1, 2019

szuecs commented Mar 7, 2019 • edited Loading

olevchyk commented May 7, 2019

szuecs commented May 7, 2019 • edited Loading

herojan commented May 8, 2019

szuecs commented May 17, 2019

herojan commented Feb 22, 2019 •

edited

Loading

seancurran157 commented Feb 27, 2019 •

edited

Loading

szuecs commented Mar 7, 2019 •

edited

Loading

szuecs commented May 7, 2019 •

edited

Loading