Skip to content

Clients experience downtime when scaling NGF #1185

Open
@kate-osborn

Description

@kate-osborn

Describe the bug
Clients experience downtime when scaling NGF.

When attempting to scale NGF to 25 replicas one by one, I noticed the following issues around replica 5:

  1. The new NGF Pod takes a while to become ready.
  2. Some NGF Pods were not staying ready. They oscillated between ready and not ready.
  3. Lots of warn and error logs in all of the NGINX containers.

Notes:

  • There are no error logs in the NGF containers

  • Status is updated fine

  • 5/7 Pods have the event Readiness probe failed: Get "http://10.4.0.17:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

  • Lots of NGINX error/warn logs that look like the following:

    "2023/10/23 19:03:23 [warn] 69#69: *265109 upstream server temporarily disabled while connecting to upstream, client: X.X.X.X, server: cafe.example.com, request: "GET /coffee HTTP/1.1", upstream: "http://X.X.X.X:8080/coffee", host: "cafe.example.com"
    
    2023/10/23 19:03:23 [error] 69#69: *265109 upstream timed out (110: Operation timed out) while connecting to upstream, client: X.X.X.X, server: cafe.example.com, request: "GET /coffee HTTP/1.1", upstream: "http://X.X.X.X:8080/coffee", host: "cafe.example.com"
    
    2023/10/23 19:04:41 [error] 68#68: *134453 no live upstreams while connecting to upstream, client: X.X.X.X, server: cafe.example.com, request: "GET /tea HTTP/1.1", upstream: "http://default_tea_80/tea", host: "cafe.example.com"
    

Wrk Output:

HTTPS:

Running 10m test @ https://cafe.example.com/tea
  2 threads and 100 connections
^C  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    20.24ms  100.72ms   1.13s    97.43%
    Req/Sec     4.25k     2.29k   12.27k    60.91%
  Latency Distribution
     50%    2.37ms
     75%    6.76ms
     90%   16.52ms
     99%  680.50ms
  2095586 requests in 6.03m, 731.21MB read
  Socket errors: connect 0, read 162, write 0, timeout 765
  Non-2xx or 3xx responses: 4782
Requests/sec:   5796.45
Transfer/sec:      2.02MB

HTTP:

Running 10m test @ http://cafe.example.com/coffee
  2 threads and 100 connections
^C  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    28.96ms  124.05ms   1.14s    96.22%
    Req/Sec     3.72k     2.05k   10.59k    64.18%
  Latency Distribution
     50%    3.11ms
     75%    8.89ms
     90%   20.17ms
     99%  808.80ms
  1552569 requests in 6.06m, 552.16MB read
  Socket errors: connect 0, read 143, write 1, timeout 899
  Non-2xx or 3xx responses: 2035
Requests/sec:   4272.38
Transfer/sec:      1.52MB

To Reproduce
Steps to reproduce the behavior:

  1. See the scale up case of https://github.com/nginxinc/nginx-gateway-fabric/blob/29c9ec015bf0929d752571ada9cc9d373ae7acdc/tests/zero-downtime-scaling/zero-downtime-scaling.md#scale-gradually

Expected behavior
Clients experience no downtime when scaling NGF

Your environment

  • Version of the NGINX Gateway Fabric - {"level":"info","ts":"2023-10-23T19:00:31Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"73f8b3a1643c2e9b8ff129aeae1ae48447c7b2d2","date":"2023-10-23T17:08:41Z"}
  • Version of Kubernetes - 1.27
  • Kubernetes platform (e.g. Mini-kube or GCP) - GKE
  • Details on how you expose the NGINX Gateway Fabric Pod (e.g. Service of type LoadBalancer or port-forward) Internal loadbalancer

Metadata

Metadata

Assignees

No one assigned

    Labels

    backlogCurrently unprioritized work. May change with user feedback or as the product progresses.testsPull requests that update tests

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions