Skip to content

[HTTP Connection Pool] Lack of timeout on SSL connection establishment caused high number of pending connections in pool #110598

Closed
@hongrui-zhang

Description

@hongrui-zhang

Description

Our service uses HttpClient to send requests to downstream services, and we observed that,

  1. [expected] During a network outage on the infrastructure that hosts the destination of the requests, a lot of requests failed with timeouts/failures
  2. [unexpected] After the network outage is resolved, the sender still experiences those timeouts/failures. This only resolves after the machine hosting the request sender is restarted

We took a dump and based on the discovery formed a hypothesis that explains above and would like .NET team to check if the hypothesis is reasonable.

Observations from dump

  1. The HttpConnectionPool that serves the destination has 88 associated connections and all of them are pending, which implies that the connection establishments are hanging
    Image
  2. By counting AsyncTaskMethodBuilder for various methods on the heap, it seems that SSL connection establishment is the culprit, not TCP connection
Count Total Size Class Name
95 19,760 System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.ValueTuple<System.IO.Stream, System.Net.TransportContext, System.Net.IPEndPoint>>+AsyncStateMachineBox<System.Net.Http.HttpConnectionPool+<ConnectAsync>d__103>
7 1,344 System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.IO.Stream>+AsyncStateMachineBox<System.Net.Http.HttpConnectionPool+<ConnectToTcpHostAsync>d__104>
95 14,440 System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Net.Security.SslStream>+AsyncStateMachineBox<System.Net.Http.ConnectHelper+<EstablishSslConnectionAsync>d__2>
95 19,760 System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>+AsyncStateMachineBox<System.Net.Security.SslStream+<ForceAuthenticationAsync>d__150<System.Net.Security.AsyncReadWriteAdapter>>
95 14,440 System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Int32>+AsyncStateMachineBox<System.Net.Security.SslStream+<ReceiveHandshakeFrameAsync>d__151<System.Net.Security.AsyncReadWriteAdapter>>

Analysis
Upon checking code in HttpConnectionPool, it seems like, under default setting, there is no cancellation for ConnectToTcpHostAsync and EstablishSslConnectionAsync (a cancellation token is passed, but with InfiniteTimeSpan). It kind of makes sense for TCP connection, as OS has timeout at OS level, but for SSL connection, I am not aware of any OS level timeout. With no OS or application level timeout, SSL connection can hang indefinitely.

Hypothesis

  1. With the network outage, HttpConnectionPool started to get contaminated with connections that hangs in SSL connection phase
  2. With PoolConnectionLifetime set in our application, healthy connections start to die off when their lifetime is up, so there are less and less healthy connections in the connection pool. Pending connections does not seem to honor PoolConnectionLifetime.
  3. Even after the network outage is resolved, pending connections are still hanging, counting towards _pendingHttp11ConnectionCount in the connection pool. High _pendingHttp11ConnectionCount makes it harder to start new connections (Connection pool has logic that only start new connection if request queue length is larger than _pendingHttp11ConnectionCount)
  4. The connection pool ended up having no working connection (as the dump showed) and a lot of pending connections, which explains the failures and timeouts we saw.

Asks to .NET team

  1. Is above hypothesis reasonable? (e.g. is there indeed no OS level timeout for SSL connection establishment, so that could theoretically hang indefinitely?)
  2. We are planning to set ConnectTimeout to some concrete value (e.g. 30 seconds). Are there any concerns/things to think about regarding that?
  3. What is the reason of having this value to be defaulted to infinite time span? What are the considerations that .NET team have?
  4. I cannot share the dump per security protocol, but I am happy to run diagnostics on it if necessary.

Reproduction Steps

No manual repro. As described above, our service sees timeouts/failures sending HTTP requests even after network outage is resolved on the infrastructure that hosts the destination of the request.

Expected behavior

HttpClient should be able to send requests to destination, after the network outage impacting the destination is resolved, without restarting sender application/machine.

Actual behavior

HttpClient reports failures and timeouts sending requests to destination, even after the network outage impacting the destination is resolved. The issue is only fixed with restarting application/machine.

Regression?

n/a

Known Workarounds

No response

Configuration

n/a

Other information

n/a

Metadata

Metadata

Assignees

Labels

area-System.Net.Httpbugin-prThere is an active PR which will close this issue when it is merged

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions