LSD update draining is graceful, but not gradual #35085

bmcalary-atlassian · 2024-07-08T02:52:34Z

Description

Short Description

When an LDS update occurs 100% of subsequent H1 responses immediately have Connection: Close added, and H2 Requests result in a GOAWAY. Connections without active requests are allowed to persist, and transactions are allowed to complete, but http codec messages to encourage a client to end their session are not introduced gradually, resulting in a large wave of reconnections.

Versions

All versions of Envoy

Long Description

We have noticed a difference between Listener draining behavior on shutdown/ curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly and LDS update with the following configuration:

    --parent-shutdown-time-s 12000 \
    --drain-time-s 10000 \
    --drain-strategy gradual

Event	Experience
shutdown or `curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly`	* Idle connections persist until `drain-time-s` * New Connections are allowed * Existing connections responses gradually have H1 `Connection: Close` and H2 GOAWAY added until 100% at `drain-time-s`
LDS update	* Idle connections persist until `drain-time-s` * New Connections are allowed (on subsequent listener) * 100% of subsequent H1 requests have `Connection: Close` added, and H2 Requests result in a GOAWAY immediately

Looking at the code, it appears that a gradual drain manager only exists for server level draining:

envoy/source/server/drain_manager_impl.cc

Lines 43 to 86 in ad15deb

    
           bool DrainManagerImpl::drainClose() const { 
        
             // If we are actively health check failed and the drain type is default, always drain close. 
        
             // 
        
             // TODO(mattklein123): In relation to x-envoy-immediate-health-check-fail, it would be better 
        
             // if even in the case of server health check failure we had some period of drain ramp up. This 
        
             // would allow the other side to fail health check for the host which will require some thread 
        
             // jumps versus immediately start GOAWAY/connection thrashing. 
        
             if (drain_type_ == envoy::config::listener::v3::Listener::DEFAULT && 
        
                 server_.healthCheckFailed()) { 
        
               return true; 
        
             } 
        
             if (!draining_) { 
        
               return false; 
        
             } 
        
             if (server_.options().drainStrategy() == Server::DrainStrategy::Immediate) { 
        
               return true; 
        
             } 
        
             ASSERT(server_.options().drainStrategy() == Server::DrainStrategy::Gradual); 
        
             // P(return true) = elapsed time / drain timeout 
        
             // If the drain deadline is exceeded, skip the probability calculation. 
        
             const MonotonicTime current_time = dispatcher_.timeSource().monotonicTime(); 
        
             if (current_time >= drain_deadline_) { 
        
               return true; 
        
             } 
        
             const auto remaining_time = 
        
                 std::chrono::duration_cast<std::chrono::seconds>(drain_deadline_ - current_time); 
        
             const auto drain_time = server_.options().drainTime(); 
        
             ASSERT(server_.options().drainTime() >= remaining_time); 
        
             const auto drain_time_count = drain_time.count(); 
        
             // If the user hasn't specified a drain timeout it will be zero, so we'll 
        
             // confirm the drainClose immediately. Otherwise we'll use the drain timeout 
        
             // as a modulus to a random number to salt the drain timing. 
        
             if (drain_time_count == 0) { 
        
               return true; 
        
             } 
        
             const auto elapsed_time = drain_time - remaining_time; 
        
             return static_cast<uint64_t>(elapsed_time.count()) > 
        
                    (server_.api().randomGenerator().random() % drain_time_count); 
        
           }

While LDS update draining is graceful in the sense that it does not suddenly close all connections, it is not gradual: all subsequent requests for all sessions/clients have Connection: Close added, and H2 Requests result in a GOAWAY.

Chart of listener_manager.lds.update_success (green, dotted) vs http.downstream_cx_drain_close (red, solid)
Note how even though drain-time-s 10000 suggests a ~3 hour drain time, all clients close their connections (envoy logs show drain closing connection on all requests, and all response headers contain Connection: Close).

Why this is a problem

This can be a problem in scenarios with high numbers (>25k-100k-500k) of persistent connections per Envoy node with a high transaction rate. Clients close their connections on H1 Connection: Close / H2 GOAWAY on request, but immediately form a new connection in order to continue making requests. This results in a massive spike in legitimate connection attempts, which can overwhelm the box in terms of: rate limits (iptables, envoy), TLS handshakes, CPU, etc.

What we think should be done

Envoy's LDS draining behavior on update should be either made consistent with server draining (/drain_listeners?graceful&inboundonly) i.e H1 Connection: Close and H2 GOAWAY should be introduced to responses gradually, or the listener should have a configurable drain_strategy: gradual which enables gradual probability of H1 Connection: Close and H2 GOAWAY per transaction.

Perhaps this could also be tackled at the same time as #34500 . So that server and LDS draining are not only gradual, but can have differently configured lengths.

Repro steps

Start envoy with the following options:

    --parent-shutdown-time-s 12000 \
    --drain-time-s 10000 \
    --drain-strategy gradual

Start a simple LDS and serve a HTTP listener e.g. which can either serve a response directly (healthcheck filter, lua filter) or bounce off a cluster e.g. httpbin

resources:
  - name: listener
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 80
    filter_chains:
      - filters:
          - name: envoy.http_connection_manager
            typed_config:
              common_http_protocol_options: {idle_timeout: 300.000s}
              http2_protocol_options: {max_concurrent_streams: 100}
              route_config:
                name: route
                virtual_hosts:
                  - name: route
                    domains:
                      - "*"
                    routes:
                      - match:
                          prefix: /
                        route:
                          cluster: httpbin
              http_filters:
                - name: Healthcheck178
                  typed_config:
                    '@type': type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
                  headers:
                    - {name: x-envoy-provide-health, present_match: true}
                  pass_through_mode: false
                - name: route
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
    traffic_direction: INBOUND
    "@type": type.googleapis.com/envoy.config.listener.v3.Listener

Start sending one or more persistent streams of H1 or H2 requests. I usually instantiate ~100-1000 clients.

import requests
import time
session = requests.Session()
while True:
    response = session.get(
        "https://127.0.0.1/anything",
        verify=False,
        headers={
            "Host": "localreply.example.com",
        },
    )
    if response.headers.get("Connection") == "close":
        print("Envoy asked for connection to be closed")
    time.sleep(1)

import requests
import time
import httpx
with httpx.Client(http2=True, verify=False) as client:
    while True:
        response = client.get(
            "https://127.0.0.1/anything",
            # verify=False,
            headers={
                "Host": "localreply.example.com",
            }
        )
        time.sleep(1)

Mutate the listener e.g. change name: Healthcheck178 to name: Healthcheck179. I did this by simply incrementing the name of the filter each LDS request.
Observe in Python or Wireshark that 100% of sessions/clients immediately H1 have Connection: Close added, and H2 Requests result in a GOAWAY on their next request, and the following is printed for all clients by the python script:

Envoy asked for connection to be closed

If you print the headers, all requests from all clients/sessions will immediately have:
Note: Each line represents a request sent by a different client session (Not a single client ignoring Connection: Close)

6. Run the experiment again without modifying the Listener, but this time call curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly
7. Notice that requests have a random chance of H1 Connection: Close / H2 GOAWAY on request, increasing to 100% likelihood at the end of --drain-time-s 10000 i.e clients are able to send one to many requests before being asked to close the connection.

Admin and Stats Output

listener_manager.total_filter_chains_draining: 294 < increments every time LDS update occurs. As expected,
downstream_cx_drain_close: XX < immediately spikes to be the same number as the number of active requests.

Logs

100% requests after LDS have this log line:

{"time": "2024-07-05 10:52:07.347", "level": "debug", "type": "http", "message": "[Tags: \"ConnectionId\":\"3116\",\"StreamId\":\"11248567122019182204\"] drain closing connection"}

The text was updated successfully, but these errors were encountered:

nezdolik · 2024-07-08T11:08:09Z

cc @adisuissa

github-actions · 2024-08-07T12:01:48Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

bmcalary-atlassian · 2024-08-08T01:54:07Z

Can we tag as no stale bot? I think this work is still relevant and worth thing.

bmcalary-atlassian added bug triage Issue requires triage labels Jul 8, 2024

nezdolik added area/http area/xds and removed triage Issue requires triage labels Jul 8, 2024

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 7, 2024

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 8, 2024

nezdolik added the no stalebot Disables stalebot from closing an issue label Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSD update draining is graceful, but not gradual #35085

LSD update draining is graceful, but not gradual #35085

bmcalary-atlassian commented Jul 8, 2024 •

edited

Loading

nezdolik commented Jul 8, 2024

github-actions bot commented Aug 7, 2024

bmcalary-atlassian commented Aug 8, 2024

LSD update draining is graceful, but not gradual #35085

LSD update draining is graceful, but not gradual #35085

Comments

bmcalary-atlassian commented Jul 8, 2024 • edited Loading

Description

Short Description

Versions

Long Description

Why this is a problem

What we think should be done

Repro steps

Admin and Stats Output

Logs

nezdolik commented Jul 8, 2024

github-actions bot commented Aug 7, 2024

bmcalary-atlassian commented Aug 8, 2024

bmcalary-atlassian commented Jul 8, 2024 •

edited

Loading