Clients experience downtime when scaling NGF

**Describe the bug**
Clients experience downtime when scaling NGF.

When attempting to scale NGF to 25 replicas one by one, I noticed the following issues around replica 5:

1. The new NGF Pod takes a while to become ready. 
2. Some NGF Pods were not staying ready. They oscillated between ready and not ready.
3. Lots of `warn` and `error` logs in all of the NGINX containers. 

Notes:
- There are no error logs in the NGF containers
- Status is updated fine
- 5/7 Pods have the event `Readiness probe failed: Get "http://10.4.0.17:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)` 
- Lots of NGINX error/warn logs that look like the following:
  
  ```
  "2023/10/23 19:03:23 [warn] 69#69: *265109 upstream server temporarily disabled while connecting to upstream, client: X.X.X.X, server: cafe.example.com, request: "GET /coffee HTTP/1.1", upstream: "http://X.X.X.X:8080/coffee", host: "cafe.example.com"
  ```
  
  ```
  2023/10/23 19:03:23 [error] 69#69: *265109 upstream timed out (110: Operation timed out) while connecting to upstream, client: X.X.X.X, server: cafe.example.com, request: "GET /coffee HTTP/1.1", upstream: "http://X.X.X.X:8080/coffee", host: "cafe.example.com"
  ```

  ```
  2023/10/23 19:04:41 [error] 68#68: *134453 no live upstreams while connecting to upstream, client: X.X.X.X, server: cafe.example.com, request: "GET /tea HTTP/1.1", upstream: "http://default_tea_80/tea", host: "cafe.example.com"
  ```

Wrk Output:

HTTPS:

```
Running 10m test @ https://cafe.example.com/tea
  2 threads and 100 connections
^C  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    20.24ms  100.72ms   1.13s    97.43%
    Req/Sec     4.25k     2.29k   12.27k    60.91%
  Latency Distribution
     50%    2.37ms
     75%    6.76ms
     90%   16.52ms
     99%  680.50ms
  2095586 requests in 6.03m, 731.21MB read
  Socket errors: connect 0, read 162, write 0, timeout 765
  Non-2xx or 3xx responses: 4782
Requests/sec:   5796.45
Transfer/sec:      2.02MB
```

HTTP:

```
Running 10m test @ http://cafe.example.com/coffee
  2 threads and 100 connections
^C  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    28.96ms  124.05ms   1.14s    96.22%
    Req/Sec     3.72k     2.05k   10.59k    64.18%
  Latency Distribution
     50%    3.11ms
     75%    8.89ms
     90%   20.17ms
     99%  808.80ms
  1552569 requests in 6.06m, 552.16MB read
  Socket errors: connect 0, read 143, write 1, timeout 899
  Non-2xx or 3xx responses: 2035
Requests/sec:   4272.38
Transfer/sec:      1.52MB
```

**To Reproduce**
Steps to reproduce the behavior:
1. See the scale up case of  https://github.com/nginxinc/nginx-gateway-fabric/blob/29c9ec015bf0929d752571ada9cc9d373ae7acdc/tests/zero-downtime-scaling/zero-downtime-scaling.md#scale-gradually 

**Expected behavior**
Clients experience no downtime when scaling NGF 

**Your environment**
* Version of the NGINX Gateway Fabric - {"level":"info","ts":"2023-10-23T19:00:31Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"73f8b3a1643c2e9b8ff129aeae1ae48447c7b2d2","date":"2023-10-23T17:08:41Z"}
* Version of Kubernetes - 1.27
* Kubernetes platform (e.g. Mini-kube or GCP) - GKE
* Details on how you expose the NGINX Gateway Fabric Pod (e.g. Service of type LoadBalancer or port-forward) Internal loadbalancer


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clients experience downtime when scaling NGF #1185

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clients experience downtime when scaling NGF #1185

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions