-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Consider some code that is trying to establish 1000s of TCP connections to the same URL (e.g. a load test tool trying to simulate 8000 users, each with its own connection).
The internet is not perfect, so some (or even many) of these connections may fail and need to be retried. For example, the OS may fail to resolve a DNS query for whatever reason (maybe it got lost in the network, maybe the OS itself just has a 16-entry buffer and drops requests if too many are concurrently executing).
Socket.ConnectAsync(EndPoint, CancellationToken) takes a CancellationToken:
| public ValueTask ConnectAsync(EndPoint remoteEP, CancellationToken cancellationToken) |
One would expect to be able to use this to cancel connection attempts on timeout. Indeed, this is where the CancellationToken constructed from SocketHttpHandler.ConnectTimeout ends up in when we try to establish HTTP connections.
But let's zoom in on the DNS part of the connection attempt:
> System.Net.NameResolution.dll!System.Net.Dns.GetHostEntryOrAddressesCoreAsync(string hostName, bool justReturnParsedIp, bool throwOnIIPAny, bool justAddresses, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 501 C#
System.Net.NameResolution.dll!System.Net.Dns.GetHostAddressesAsync(string hostNameOrAddress, System.Net.Sockets.AddressFamily family, System.Threading.CancellationToken cancellationToken) Line 241 C#
System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.DnsConnectAsync(System.Net.DnsEndPoint endPoint, System.Net.Sockets.SocketType socketType, System.Net.Sockets.ProtocolType protocolType) Line 667 C#
System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.Sockets.SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable) Line 2697 C#
System.Net.Sockets.dll!System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ConnectAsync(System.Net.Sockets.Socket socket) Line 1182 C#
System.Net.Sockets.dll!System.Net.Sockets.Socket.ConnectAsync(System.Net.EndPoint remoteEP, System.Threading.CancellationToken cancellationToken) Line 101 C#
System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(string host, int port, System.Net.Http.HttpRequestMessage initialRequest, bool async, System.Threading.CancellationToken cancellationToken) Line 1600 C#
System.Net.Http.dll!System.Net.Http.HttpConnectionPool.ConnectAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1526 C#
System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(System.Net.Http.HttpRequestMessage request, bool async, System.Threading.CancellationToken cancellationToken) Line 1631 C#
System.Net.Http.dll!System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) Line 480 C#
System.Net.Http.dll!System.Net.Http.HttpConnectionPool.CheckForHttp11ConnectionInjection.AnonymousMethod__77_0((System.Net.Http.HttpConnectionPool thisRef, System.Net.Http.HttpConnectionPool.RequestQueue<System.Net.Http.HttpConnection>.QueueItem queueItem) state) Line 541 C#
One of the first things Socket.ConnectAsync() does it call AwaitableSocketAsyncEventArgs.ConnectAsync() to start the process of connecting (and after that waits for it to complete, with proper cancellation handling for the wait):
| ValueTask connectTask = saea.ConnectAsync(this); |
This just sets up some bookeeping and takes us to an internal overload of Socket.ConnectAsync():
| internal bool ConnectAsync(SocketAsyncEventArgs e, bool userSocket, bool saeaCancelable) |
If we are dealing with a DnsEndPoint (i.e. the URL had a hostname instead of IP address), this now calls DnsConnectAsync():
| pending = e.DnsConnectAsync(dnsEP, default, default); |
DnsConnectAsync() will try to grab a CancellationToken for the DNS query from a variable called _multipleConnectCancellation. I do not understand what that is but in my experiments the value is simply null, so we end up not passing any CancellationToken to Dns.GetHostAddressesAsync(). The actual CancellationToken accompanying our Socket.ConnectAsync() we left behind several method calls ago.
runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs
Lines 649 to 659 in 62e50b6
| internal bool DnsConnectAsync(DnsEndPoint endPoint, SocketType socketType, ProtocolType protocolType) | |
| { | |
| Debug.Assert(endPoint.AddressFamily == AddressFamily.Unspecified || | |
| endPoint.AddressFamily == AddressFamily.InterNetwork || | |
| endPoint.AddressFamily == AddressFamily.InterNetworkV6); | |
| CancellationToken cancellationToken = _multipleConnectCancellation?.Token ?? default; | |
| // In .NET 5 and earlier, the APM implementation allowed for synchronous exceptions from this to propagate | |
| // synchronously. This call is made here rather than in the Core async method below to preserve that behavior. | |
| Task<IPAddress[]> addressesTask = Dns.GetHostAddressesAsync(endPoint.Host, endPoint.AddressFamily, cancellationToken); |
In other words, even if a socket connection attempt is cancelled (e.g. due to timeout), the DNS query it triggered remains active.
On its own, it just sounds inefficienct. However, When we combine this with #49171, we get in trouble. We can now get into the following situation:
- A large queue of pending connection attempts has built up in this one-at-a-time DNS queue (e.g. because there was a delay and OS resolver needed 5-10 seconds of retries for a single request, while the app at the sime time was continuing in its job of creating 1000s of connections).
- Because the original slow request took a long time, the majority of the queue may already have timed out! The DNS requests are still in the queue and will still be executed because the CancellationToken is not passed to
Dns.GetHostAddressesAsync(). - Some of these DNS queries may also be slow (after all, there are bound to be some failures in thousands of them) and again require 5+ seconds of waiting before OS resolver gets a response after retries (or never?).
- Meanwhile, the HttpConnectionPool keeps retrying and making new connections but they all keep timing out because the DNS queries are all still sitting in the queue by the time the timeout elapses and the CancellationToken (which DNS is not listening to) is signaled.
In a sample scenario with 8000 HttpClients, after the initial slow DNS resolve starts the described chain of events, I am seeing average DNS resolve durations of 30-40 seconds (this includes the time spent in the queue).
Running this on a new D32 VM on Ubuntu 22 with .NET 7, I soon (depends on run, but no more than 30 seconds) start to see output similar to:
azureuser@hack01:~/33/net7.0$ dotnet ConsoleApp33.dll
Successfully completed 0 requests, last one took 0.0 ms and finished 4.6 ms ago. 0 failures seen. 0/8000 tasks successfully completed at least one request.
Successfully completed 62 requests, last one took 902.8 ms and finished 77.6 ms ago. 0 failures seen. 62/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 49.5 ms ago. 0 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 1049.7 ms ago. 6 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 2049.7 ms ago. 12 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 3049.4 ms ago. 15 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 4048.9 ms ago. 39 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 5049.6 ms ago. 184 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 6049.8 ms ago. 330 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 7049.8 ms ago. 468 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 8050.0 ms ago. 611 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 9049.9 ms ago. 763 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 10049.0 ms ago. 904 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 11049.7 ms ago. 1034 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 12049.6 ms ago. 1171 failures seen. 68/8000 tasks successfully completed at least one request.
Successfully completed 68 requests, last one took 1906.8 ms and finished 13049.9 ms ago. 1313 failures seen. 68/8000 tasks successfully completed at least one request.
None of the fresh (not yet connected) tasks at this point will successfully make an HTTP request - all retries will fail because they will all time out while sitting in the DNS queue.
dotnet-counters shows the queue size in this particular case to be over 1 minute!
[System.Net.NameResolution]
Average DNS Lookup Duration (ms) 67,039.097
Current DNS Lookups 7,932
DNS Lookups Requested 22,315
Related to #81023 but different, as in my case I have a forever-loop of failed requests because the app keeps retrying fast enough to maintain critical mass in the DNS request queue.
