-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Our satellite is recording a significant number of dropped spans from a deployed NET client, and after digging through the code, I believe I've hit on a potential bug in the tracing code, and would like feedback/thoughts on whether this is a valid concern.
What I observe is that the tracer has no mechanism to prevent ServicePoint.ConnectionLimit
being exceeded, and if we end up with, say 10-20 Flushes in progress with only 2 HTTP connections to process them, many of them may time out and then have their spans dropped.
Requested Info | Value |
---|---|
Tested library version | v0.8.0.362 |
Runtime | NET Framework 4.8 |
Assembly target | NET Standard 2.0 |
A timeout visible on the client via proxy might look something like:
Example request:
POST https://satellite/api/v2/reports HTTP/1.1
Accept: application/octet-stream
Content-Type: application/octet-stream
Lightstep-Access-Token: token
Host: satellite
Content-Length: 509
Expect: 100-continue
(no body sent)
Satellite response recorded by proxy 30 seconds later:
HTTP/1.1 408 Request body incomplete
Date: Thu, 13 Aug 2020 13:12:07 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache, must-revalidate
Timestamp: 14:12:07.486
The request body did not contain the specified number of bytes. Got 0, expected 509
This looks to be a problem if sending data to the satellite takes a while and there are a number of concurrent Flushes in progress - it's only been observed when the clients are very busy with lots of spans. It seems unlikely that it takes >30 seconds to send 509 bytes of body yet the same system can send 221 header characters just fine before that point. My reasoning is as follows.
Preamble:
- On NET Framework in Windows, HTTP is delegated via
ServicePoint
(per base address) which has aConnectionLimit
property (defaults to 2 courtesy ofServicePointManager.DefaultConnectionLimit
default) - In addition, we're forced to use HTTP 1 rather than HTTP 2 because NET Framework & supported versions of Windows don't have HTTP 2 available
- The older
WinHttpHandler
sends request header & body as two separate Win32 instructions
The tracing Flush
is run via
_reportLoop = new Timer(async e => await Flush(), null, TimeSpan.Zero, _options.ReportPeriod); |
I believe there's a subtlety in the C# tracing code that can lead to intermittent timeouts which gets worse the greater the load:
- The above
Timer
means that every Report Period (default 5 seconds), a newFlush
is called i.e. NFlush
operations can occur concurrently - Each
Flush
callsHttpClient.SendAsync
which will take some time to complete, so there can be N attempts to make an HTTP call in flight at once, getting worse the more spans occurring & shorter reporting periods chosen - If the underlying
ServicePoint
connection isn't made available back toWinHttpHandler
in time, the C# operation will time out due to the internal cancellation token, which is exactly what I see reported in the associated exception handler :Timed out sending report to satellite
If you agree, I would propose having a WithConcurrentReports
option that links to a SemaphoreSlim
& connection limit on the HttpClientHandler
, so that NET Framework consumers can control & match outgoing concurrency safely.