Description
Description
Our service uses HttpClient to send requests to downstream services, and we observed that,
- [expected] During a network outage on the infrastructure that hosts the destination of the requests, a lot of requests failed with timeouts/failures
- [unexpected] After the network outage is resolved, the sender still experiences those timeouts/failures. This only resolves after the machine hosting the request sender is restarted
We took a dump and based on the discovery formed a hypothesis that explains above and would like .NET team to check if the hypothesis is reasonable.
Observations from dump
- The
HttpConnectionPool
that serves the destination has 88 associated connections and all of them are pending, which implies that the connection establishments are hanging
- By counting
AsyncTaskMethodBuilder
for various methods on the heap, it seems that SSL connection establishment is the culprit, not TCP connection
Count | Total Size | Class Name |
---|---|---|
95 | 19,760 | System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.ValueTuple<System.IO.Stream, System.Net.TransportContext, System.Net.IPEndPoint>>+AsyncStateMachineBox<System.Net.Http.HttpConnectionPool+<ConnectAsync>d__103> |
7 | 1,344 | System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.IO.Stream>+AsyncStateMachineBox<System.Net.Http.HttpConnectionPool+<ConnectToTcpHostAsync>d__104> |
95 | 14,440 | System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Net.Security.SslStream>+AsyncStateMachineBox<System.Net.Http.ConnectHelper+<EstablishSslConnectionAsync>d__2> |
95 | 19,760 | System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>+AsyncStateMachineBox<System.Net.Security.SslStream+<ForceAuthenticationAsync>d__150<System.Net.Security.AsyncReadWriteAdapter>> |
95 | 14,440 | System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Int32>+AsyncStateMachineBox<System.Net.Security.SslStream+<ReceiveHandshakeFrameAsync>d__151<System.Net.Security.AsyncReadWriteAdapter>> |
Analysis
Upon checking code in HttpConnectionPool
, it seems like, under default setting, there is no cancellation for ConnectToTcpHostAsync
and EstablishSslConnectionAsync
(a cancellation token is passed, but with InfiniteTimeSpan). It kind of makes sense for TCP connection, as OS has timeout at OS level, but for SSL connection, I am not aware of any OS level timeout. With no OS or application level timeout, SSL connection can hang indefinitely.
Hypothesis
- With the network outage,
HttpConnectionPool
started to get contaminated with connections that hangs in SSL connection phase - With
PoolConnectionLifetime
set in our application, healthy connections start to die off when their lifetime is up, so there are less and less healthy connections in the connection pool. Pending connections does not seem to honorPoolConnectionLifetime
. - Even after the network outage is resolved, pending connections are still hanging, counting towards
_pendingHttp11ConnectionCount
in the connection pool. High_pendingHttp11ConnectionCount
makes it harder to start new connections (Connection pool has logic that only start new connection if request queue length is larger than_pendingHttp11ConnectionCount
) - The connection pool ended up having no working connection (as the dump showed) and a lot of pending connections, which explains the failures and timeouts we saw.
Asks to .NET team
- Is above hypothesis reasonable? (e.g. is there indeed no OS level timeout for SSL connection establishment, so that could theoretically hang indefinitely?)
- We are planning to set ConnectTimeout to some concrete value (e.g. 30 seconds). Are there any concerns/things to think about regarding that?
- What is the reason of having this value to be defaulted to infinite time span? What are the considerations that .NET team have?
- I cannot share the dump per security protocol, but I am happy to run diagnostics on it if necessary.
Reproduction Steps
No manual repro. As described above, our service sees timeouts/failures sending HTTP requests even after network outage is resolved on the infrastructure that hosts the destination of the request.
Expected behavior
HttpClient should be able to send requests to destination, after the network outage impacting the destination is resolved, without restarting sender application/machine.
Actual behavior
HttpClient reports failures and timeouts sending requests to destination, even after the network outage impacting the destination is resolved. The issue is only fixed with restarting application/machine.
Regression?
n/a
Known Workarounds
No response
Configuration
n/a
Other information
n/a