You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When an LDS update occurs 100% of subsequent H1 responses immediately have Connection: Close added, and H2 Requests result in a GOAWAY. Connections without active requests are allowed to persist, and transactions are allowed to complete, but http codec messages to encourage a client to end their session are not introduced gradually, resulting in a large wave of reconnections.
Versions
All versions of Envoy
Long Description
We have noticed a difference between Listener draining behavior on shutdown/ curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly and LDS update with the following configuration:
shutdown or curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly
* Idle connections persist until drain-time-s * New Connections are allowed * Existing connections responses gradually have H1 Connection: Close and H2 GOAWAY added until 100% at drain-time-s
LDS update
* Idle connections persist until drain-time-s * New Connections are allowed (on subsequent listener) * 100% of subsequent H1 requests have Connection: Close added, and H2 Requests result in a GOAWAY immediately
Looking at the code, it appears that a gradual drain manager only exists for server level draining:
While LDS update draining is graceful in the sense that it does not suddenly close all connections, it is not gradual: all subsequent requests for all sessions/clients have Connection: Close added, and H2 Requests result in a GOAWAY.
Chart of listener_manager.lds.update_success (green, dotted) vs http.downstream_cx_drain_close (red, solid)
Note how even though drain-time-s 10000 suggests a ~3 hour drain time, all clients close their connections (envoy logs show drain closing connection on all requests, and all response headers contain Connection: Close).
Why this is a problem
This can be a problem in scenarios with high numbers (>25k-100k-500k) of persistent connections per Envoy node with a high transaction rate. Clients close their connections on H1 Connection: Close / H2 GOAWAY on request, but immediately form a new connection in order to continue making requests. This results in a massive spike in legitimate connection attempts, which can overwhelm the box in terms of: rate limits (iptables, envoy), TLS handshakes, CPU, etc.
What we think should be done
Envoy's LDS draining behavior on update should be either made consistent with server draining (/drain_listeners?graceful&inboundonly) i.e H1 Connection: Close and H2 GOAWAY should be introduced to responses gradually, or the listener should have a configurable drain_strategy: gradual which enables gradual probability of H1 Connection: Close and H2 GOAWAY per transaction.
Perhaps this could also be tackled at the same time as #34500 . So that server and LDS draining are not only gradual, but can have differently configured lengths.
Start a simple LDS and serve a HTTP listener e.g. which can either serve a response directly (healthcheck filter, lua filter) or bounce off a cluster e.g. httpbin
Start sending one or more persistent streams of H1 or H2 requests. I usually instantiate ~100-1000 clients.
import requests
import time
session = requests.Session()
while True:
response = session.get(
"https://127.0.0.1/anything",
verify=False,
headers={
"Host": "localreply.example.com",
},
)
if response.headers.get("Connection") == "close":
print("Envoy asked for connection to be closed")
time.sleep(1)
import requests
import time
import httpx
with httpx.Client(http2=True, verify=False) as client:
while True:
response = client.get(
"https://127.0.0.1/anything",
# verify=False,
headers={
"Host": "localreply.example.com",
}
)
time.sleep(1)
Mutate the listener e.g. change name: Healthcheck178 to name: Healthcheck179. I did this by simply incrementing the name of the filter each LDS request.
Observe in Python or Wireshark that 100% of sessions/clients immediately H1 have Connection: Close added, and H2 Requests result in a GOAWAY on their next request, and the following is printed for all clients by the python script:
Envoy asked for connection to be closed
If you print the headers, all requests from all clients/sessions will immediately have:
Note: Each line represents a request sent by a different client session (Not a single client ignoring Connection: Close)
6. Run the experiment again without modifying the Listener, but this time call curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly
7. Notice that requests have a random chance of H1 Connection: Close / H2 GOAWAY on request, increasing to 100% likelihood at the end of --drain-time-s 10000 i.e clients are able to send one to many requests before being asked to close the connection.
Admin and Stats Output
listener_manager.total_filter_chains_draining: 294 < increments every time LDS update occurs. As expected, downstream_cx_drain_close: XX < immediately spikes to be the same number as the number of active requests.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Description
Short Description
When an LDS update occurs 100% of subsequent H1 responses immediately have
Connection: Close
added, and H2 Requests result in a GOAWAY. Connections without active requests are allowed to persist, and transactions are allowed to complete, but http codec messages to encourage a client to end their session are not introduced gradually, resulting in a large wave of reconnections.Versions
All versions of Envoy
Long Description
We have noticed a difference between Listener draining behavior on shutdown/
curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly
and LDS update with the following configuration:curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly
drain-time-s
* New Connections are allowed
* Existing connections responses gradually have H1
Connection: Close
and H2 GOAWAY added until 100% atdrain-time-s
drain-time-s
* New Connections are allowed (on subsequent listener)
* 100% of subsequent H1 requests have
Connection: Close
added, and H2 Requests result in a GOAWAY immediatelyLooking at the code, it appears that a gradual drain manager only exists for server level draining:
envoy/source/server/drain_manager_impl.cc
Lines 43 to 86 in ad15deb
While LDS update draining is graceful in the sense that it does not suddenly close all connections, it is not gradual: all subsequent requests for all sessions/clients have
Connection: Close
added, and H2 Requests result in a GOAWAY.Chart of
listener_manager.lds.update_success
(green, dotted) vshttp.downstream_cx_drain_close
(red, solid)Note how even though
drain-time-s 10000
suggests a ~3 hour drain time, all clients close their connections (envoy logs showdrain closing connection
on all requests, and all response headers containConnection: Close
).Why this is a problem
This can be a problem in scenarios with high numbers (>25k-100k-500k) of persistent connections per Envoy node with a high transaction rate. Clients close their connections on H1
Connection: Close
/ H2 GOAWAY on request, but immediately form a new connection in order to continue making requests. This results in a massive spike in legitimate connection attempts, which can overwhelm the box in terms of: rate limits (iptables, envoy), TLS handshakes, CPU, etc.What we think should be done
Envoy's LDS draining behavior on update should be either made consistent with server draining (
/drain_listeners?graceful&inboundonly
) i.e H1Connection: Close
and H2 GOAWAY should be introduced to responses gradually, or the listener should have a configurabledrain_strategy: gradual
which enables gradual probability of H1Connection: Close
and H2 GOAWAY per transaction.Perhaps this could also be tackled at the same time as #34500 . So that server and LDS draining are not only gradual, but can have differently configured lengths.
Repro steps
name: Healthcheck178
toname: Healthcheck179
. I did this by simply incrementing the name of the filter each LDS request.Connection: Close
added, and H2 Requests result in a GOAWAY on their next request, and the following is printed for all clients by the python script:If you print the headers, all requests from all clients/sessions will immediately have:
Note: Each line represents a request sent by a different client session (Not a single client ignoring
Connection: Close
)6. Run the experiment again without modifying the Listener, but this time call
curl -X POST 'http://127.0.0.1:9901/drain_listeners?graceful&inboundonly
7. Notice that requests have a random chance of H1
Connection: Close
/ H2 GOAWAY on request, increasing to 100% likelihood at the end of--drain-time-s 10000
i.e clients are able to send one to many requests before being asked to close the connection.Admin and Stats Output
listener_manager.total_filter_chains_draining: 294
< increments every time LDS update occurs. As expected,downstream_cx_drain_close: XX
< immediately spikes to be the same number as the number of active requests.Logs
100% requests after LDS have this log line:
The text was updated successfully, but these errors were encountered: