Skip to content

Intermittent timeouts sending data to satellites when clients are under heavy load #92

@hotchkj

Description

@hotchkj

Our satellite is recording a significant number of dropped spans from a deployed NET client, and after digging through the code, I believe I've hit on a potential bug in the tracing code, and would like feedback/thoughts on whether this is a valid concern.

What I observe is that the tracer has no mechanism to prevent ServicePoint.ConnectionLimit being exceeded, and if we end up with, say 10-20 Flushes in progress with only 2 HTTP connections to process them, many of them may time out and then have their spans dropped.

Requested Info Value
Tested library version v0.8.0.362
Runtime NET Framework 4.8
Assembly target NET Standard 2.0

A timeout visible on the client via proxy might look something like:
Example request:

POST https://satellite/api/v2/reports HTTP/1.1
Accept: application/octet-stream
Content-Type: application/octet-stream
Lightstep-Access-Token: token
Host: satellite
Content-Length: 509
Expect: 100-continue
(no body sent)

Satellite response recorded by proxy 30 seconds later:

HTTP/1.1 408 Request body incomplete
Date: Thu, 13 Aug 2020 13:12:07 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache, must-revalidate
Timestamp: 14:12:07.486
The request body did not contain the specified number of bytes. Got 0, expected 509

This looks to be a problem if sending data to the satellite takes a while and there are a number of concurrent Flushes in progress - it's only been observed when the clients are very busy with lots of spans. It seems unlikely that it takes >30 seconds to send 509 bytes of body yet the same system can send 221 header characters just fine before that point. My reasoning is as follows.

Preamble:

  • On NET Framework in Windows, HTTP is delegated via ServicePoint (per base address) which has a ConnectionLimit property (defaults to 2 courtesy of ServicePointManager.DefaultConnectionLimit default)
  • In addition, we're forced to use HTTP 1 rather than HTTP 2 because NET Framework & supported versions of Windows don't have HTTP 2 available
  • The older WinHttpHandler sends request header & body as two separate Win32 instructions

The tracing Flush is run via

_reportLoop = new Timer(async e => await Flush(), null, TimeSpan.Zero, _options.ReportPeriod);

I believe there's a subtlety in the C# tracing code that can lead to intermittent timeouts which gets worse the greater the load:

  • The above Timer means that every Report Period (default 5 seconds), a new Flush is called i.e. N Flush operations can occur concurrently
  • Each Flush calls HttpClient.SendAsync which will take some time to complete, so there can be N attempts to make an HTTP call in flight at once, getting worse the more spans occurring & shorter reporting periods chosen
  • If the underlying ServicePoint connection isn't made available back to WinHttpHandler in time, the C# operation will time out due to the internal cancellation token, which is exactly what I see reported in the associated exception handler : Timed out sending report to satellite

If you agree, I would propose having a WithConcurrentReports option that links to a SemaphoreSlim & connection limit on the HttpClientHandler, so that NET Framework consumers can control & match outgoing concurrency safely.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions