Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSD update draining is graceful, but not gradual #35085

Open
bmcalary-atlassian opened this issue Jul 8, 2024 · 3 comments
Open

LSD update draining is graceful, but not gradual #35085

bmcalary-atlassian opened this issue Jul 8, 2024 · 3 comments
Labels
area/http area/xds bug no stalebot Disables stalebot from closing an issue

Comments

@bmcalary-atlassian
Copy link

bmcalary-atlassian commented Jul 8, 2024

Description

Short Description

When an LDS update occurs 100% of subsequent H1 responses immediately have Connection: Close added, and H2 Requests result in a GOAWAY. Connections without active requests are allowed to persist, and transactions are allowed to complete, but http codec messages to encourage a client to end their session are not introduced gradually, resulting in a large wave of reconnections.

Versions

All versions of Envoy

Long Description

We have noticed a difference between Listener draining behavior on shutdown/ curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly and LDS update with the following configuration:

    --parent-shutdown-time-s 12000 \
    --drain-time-s 10000 \
    --drain-strategy gradual
Event Experience
shutdown or curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly * Idle connections persist until drain-time-s
* New Connections are allowed
* Existing connections responses gradually have H1 Connection: Close and H2 GOAWAY added until 100% at drain-time-s
LDS update * Idle connections persist until drain-time-s
* New Connections are allowed (on subsequent listener)
* 100% of subsequent H1 requests have Connection: Close added, and H2 Requests result in a GOAWAY immediately

Looking at the code, it appears that a gradual drain manager only exists for server level draining:

bool DrainManagerImpl::drainClose() const {
// If we are actively health check failed and the drain type is default, always drain close.
//
// TODO(mattklein123): In relation to x-envoy-immediate-health-check-fail, it would be better
// if even in the case of server health check failure we had some period of drain ramp up. This
// would allow the other side to fail health check for the host which will require some thread
// jumps versus immediately start GOAWAY/connection thrashing.
if (drain_type_ == envoy::config::listener::v3::Listener::DEFAULT &&
server_.healthCheckFailed()) {
return true;
}
if (!draining_) {
return false;
}
if (server_.options().drainStrategy() == Server::DrainStrategy::Immediate) {
return true;
}
ASSERT(server_.options().drainStrategy() == Server::DrainStrategy::Gradual);
// P(return true) = elapsed time / drain timeout
// If the drain deadline is exceeded, skip the probability calculation.
const MonotonicTime current_time = dispatcher_.timeSource().monotonicTime();
if (current_time >= drain_deadline_) {
return true;
}
const auto remaining_time =
std::chrono::duration_cast<std::chrono::seconds>(drain_deadline_ - current_time);
const auto drain_time = server_.options().drainTime();
ASSERT(server_.options().drainTime() >= remaining_time);
const auto drain_time_count = drain_time.count();
// If the user hasn't specified a drain timeout it will be zero, so we'll
// confirm the drainClose immediately. Otherwise we'll use the drain timeout
// as a modulus to a random number to salt the drain timing.
if (drain_time_count == 0) {
return true;
}
const auto elapsed_time = drain_time - remaining_time;
return static_cast<uint64_t>(elapsed_time.count()) >
(server_.api().randomGenerator().random() % drain_time_count);
}

While LDS update draining is graceful in the sense that it does not suddenly close all connections, it is not gradual: all subsequent requests for all sessions/clients have Connection: Close added, and H2 Requests result in a GOAWAY.

Chart of listener_manager.lds.update_success (green, dotted) vs http.downstream_cx_drain_close (red, solid)
Note how even though drain-time-s 10000 suggests a ~3 hour drain time, all clients close their connections (envoy logs show drain closing connection on all requests, and all response headers contain Connection: Close).
image

Why this is a problem

This can be a problem in scenarios with high numbers (>25k-100k-500k) of persistent connections per Envoy node with a high transaction rate. Clients close their connections on H1 Connection: Close / H2 GOAWAY on request, but immediately form a new connection in order to continue making requests. This results in a massive spike in legitimate connection attempts, which can overwhelm the box in terms of: rate limits (iptables, envoy), TLS handshakes, CPU, etc.

What we think should be done

Envoy's LDS draining behavior on update should be either made consistent with server draining (/drain_listeners?graceful&inboundonly) i.e H1 Connection: Close and H2 GOAWAY should be introduced to responses gradually, or the listener should have a configurable drain_strategy: gradual which enables gradual probability of H1 Connection: Close and H2 GOAWAY per transaction.

Perhaps this could also be tackled at the same time as #34500 . So that server and LDS draining are not only gradual, but can have differently configured lengths.

Repro steps

  1. Start envoy with the following options:
    --parent-shutdown-time-s 12000 \
    --drain-time-s 10000 \
    --drain-strategy gradual
  1. Start a simple LDS and serve a HTTP listener e.g. which can either serve a response directly (healthcheck filter, lua filter) or bounce off a cluster e.g. httpbin
resources:
  - name: listener
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 80
    filter_chains:
      - filters:
          - name: envoy.http_connection_manager
            typed_config:
              common_http_protocol_options: {idle_timeout: 300.000s}
              http2_protocol_options: {max_concurrent_streams: 100}
              route_config:
                name: route
                virtual_hosts:
                  - name: route
                    domains:
                      - "*"
                    routes:
                      - match:
                          prefix: /
                        route:
                          cluster: httpbin
              http_filters:
                - name: Healthcheck178
                  typed_config:
                    '@type': type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
                  headers:
                    - {name: x-envoy-provide-health, present_match: true}
                  pass_through_mode: false
                - name: route
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
    traffic_direction: INBOUND
    "@type": type.googleapis.com/envoy.config.listener.v3.Listener
  1. Start sending one or more persistent streams of H1 or H2 requests. I usually instantiate ~100-1000 clients.
import requests
import time
session = requests.Session()
while True:
    response = session.get(
        "https://127.0.0.1/anything",
        verify=False,
        headers={
            "Host": "localreply.example.com",
        },
    )
    if response.headers.get("Connection") == "close":
        print("Envoy asked for connection to be closed")
    time.sleep(1)
import requests
import time
import httpx
with httpx.Client(http2=True, verify=False) as client:
    while True:
        response = client.get(
            "https://127.0.0.1/anything",
            # verify=False,
            headers={
                "Host": "localreply.example.com",
            }
        )
        time.sleep(1)
  1. Mutate the listener e.g. change name: Healthcheck178 to name: Healthcheck179. I did this by simply incrementing the name of the filter each LDS request.
  2. Observe in Python or Wireshark that 100% of sessions/clients immediately H1 have Connection: Close added, and H2 Requests result in a GOAWAY on their next request, and the following is printed for all clients by the python script:
Envoy asked for connection to be closed

If you print the headers, all requests from all clients/sessions will immediately have:
Note: Each line represents a request sent by a different client session (Not a single client ignoring Connection: Close)
image
6. Run the experiment again without modifying the Listener, but this time call curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly
7. Notice that requests have a random chance of H1 Connection: Close / H2 GOAWAY on request, increasing to 100% likelihood at the end of --drain-time-s 10000 i.e clients are able to send one to many requests before being asked to close the connection.

Admin and Stats Output

listener_manager.total_filter_chains_draining: 294 < increments every time LDS update occurs. As expected,
downstream_cx_drain_close: XX < immediately spikes to be the same number as the number of active requests.

Logs

100% requests after LDS have this log line:

{"time": "2024-07-05 10:52:07.347", "level": "debug", "type": "http", "message": "[Tags: \"ConnectionId\":\"3116\",\"StreamId\":\"11248567122019182204\"] drain closing connection"}
@bmcalary-atlassian bmcalary-atlassian added bug triage Issue requires triage labels Jul 8, 2024
@nezdolik nezdolik added area/http area/xds and removed triage Issue requires triage labels Jul 8, 2024
@nezdolik
Copy link
Member

nezdolik commented Jul 8, 2024

cc @adisuissa

Copy link

github-actions bot commented Aug 7, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 7, 2024
@bmcalary-atlassian
Copy link
Author

Can we tag as no stale bot? I think this work is still relevant and worth thing.

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 8, 2024
@nezdolik nezdolik added the no stalebot Disables stalebot from closing an issue label Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/http area/xds bug no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

2 participants