Open
Description
Describe the bug
Clients experience downtime when scaling NGF.
When attempting to scale NGF to 25 replicas one by one, I noticed the following issues around replica 5:
- The new NGF Pod takes a while to become ready.
- Some NGF Pods were not staying ready. They oscillated between ready and not ready.
- Lots of
warn
anderror
logs in all of the NGINX containers.
Notes:
-
There are no error logs in the NGF containers
-
Status is updated fine
-
5/7 Pods have the event
Readiness probe failed: Get "http://10.4.0.17:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
-
Lots of NGINX error/warn logs that look like the following:
"2023/10/23 19:03:23 [warn] 69#69: *265109 upstream server temporarily disabled while connecting to upstream, client: X.X.X.X, server: cafe.example.com, request: "GET /coffee HTTP/1.1", upstream: "http://X.X.X.X:8080/coffee", host: "cafe.example.com"
2023/10/23 19:03:23 [error] 69#69: *265109 upstream timed out (110: Operation timed out) while connecting to upstream, client: X.X.X.X, server: cafe.example.com, request: "GET /coffee HTTP/1.1", upstream: "http://X.X.X.X:8080/coffee", host: "cafe.example.com"
2023/10/23 19:04:41 [error] 68#68: *134453 no live upstreams while connecting to upstream, client: X.X.X.X, server: cafe.example.com, request: "GET /tea HTTP/1.1", upstream: "http://default_tea_80/tea", host: "cafe.example.com"
Wrk Output:
HTTPS:
Running 10m test @ https://cafe.example.com/tea
2 threads and 100 connections
^C Thread Stats Avg Stdev Max +/- Stdev
Latency 20.24ms 100.72ms 1.13s 97.43%
Req/Sec 4.25k 2.29k 12.27k 60.91%
Latency Distribution
50% 2.37ms
75% 6.76ms
90% 16.52ms
99% 680.50ms
2095586 requests in 6.03m, 731.21MB read
Socket errors: connect 0, read 162, write 0, timeout 765
Non-2xx or 3xx responses: 4782
Requests/sec: 5796.45
Transfer/sec: 2.02MB
HTTP:
Running 10m test @ http://cafe.example.com/coffee
2 threads and 100 connections
^C Thread Stats Avg Stdev Max +/- Stdev
Latency 28.96ms 124.05ms 1.14s 96.22%
Req/Sec 3.72k 2.05k 10.59k 64.18%
Latency Distribution
50% 3.11ms
75% 8.89ms
90% 20.17ms
99% 808.80ms
1552569 requests in 6.06m, 552.16MB read
Socket errors: connect 0, read 143, write 1, timeout 899
Non-2xx or 3xx responses: 2035
Requests/sec: 4272.38
Transfer/sec: 1.52MB
To Reproduce
Steps to reproduce the behavior:
- See the scale up case of https://github.com/nginxinc/nginx-gateway-fabric/blob/29c9ec015bf0929d752571ada9cc9d373ae7acdc/tests/zero-downtime-scaling/zero-downtime-scaling.md#scale-gradually
Expected behavior
Clients experience no downtime when scaling NGF
Your environment
- Version of the NGINX Gateway Fabric - {"level":"info","ts":"2023-10-23T19:00:31Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"73f8b3a1643c2e9b8ff129aeae1ae48447c7b2d2","date":"2023-10-23T17:08:41Z"}
- Version of Kubernetes - 1.27
- Kubernetes platform (e.g. Mini-kube or GCP) - GKE
- Details on how you expose the NGINX Gateway Fabric Pod (e.g. Service of type LoadBalancer or port-forward) Internal loadbalancer
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
🆕 New