[release/9.0-staging] Fix race condition when cancelling pending HTTP connection attempts #110764
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of #110744 to release/9.0-staging
Fixes #110598
/cc @MihaZupan
Customer Impact
HttpClient
is often used for server-to-server communication. If a service experiences an outage, requests to that service will fail as expected, butHttpClient
is expected to eventually recover once the service becomes available again.Due to a race condition,
HttpClient
's connection pool may become stuck, preventing new connections to the other server from being established.To recover, users must restart the process in the service that didn't have an outage in the first place.
We've heard from two internal services (one of them reported in #110598) where this issue led to prolonged recovery times after an outage in Azure.
Regression
Yes - introduced in .NET 6 (where we rewrote HTTP connection pool).
Testing
Added a targeted test (also into CI) that reliably reproduces the stuck connection pool state.
Further manual validation was performed to make sure the problem is fully addressed.
Risk
Low.
There are logically 3 places where we use some state, and 2 of them were already using a lock. This change is limited in scope to effectively update the 3rd place to do the same.