[release/7.0] [QUIC] Root listener/connection while waiting on new connection/stream event #74740
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of the first commit of #74450 to release/7.0
Fixes #73377.
Root cause
When awaiting reading from
Channel
(whether it's channel of connections in listener or channel of stream in connection) we're technically awaiting a native event from msquic. During this awaiting we were hitting another instance of the problem stated in https://devblogs.microsoft.com/pfxteam/keeping-async-methods-alive/. In other words, GC would collect listener/connection while we were waiting for NEW_CONNECTION/NEW_STREAM events.Fix
Added rooting of listener/connection in
AcceptIncoming...
methods. Also added tests directly targeted at this problem (they were consistently failing before the fix, always on the first run).Customer Impact
The error manifests as a hang. A hang about which a user is not able to do anything. And it might lead to a false suspicion that a dead-lock is happening.
Note that we did have the same problem before (in 6.0), even with more places where GC had an opportunity to collect Listener/Connection/Stream. My suspicion is timing changes due to API changes and/or other changes elsewhere just uncovered this.
Testing
Re-enable failing tests as part of fix. Also added 2 more targeted tests that were failing before the fix and now are passing.
Risk
Low since Quic is in preview mode. Should not have impact on HTTP/3.