Intermittent timeouts sending data to satellites when clients are under heavy load

Our satellite is recording a significant number of dropped spans from a deployed NET client, and after digging through the code, I believe I've hit on a potential bug in the tracing code, and would like feedback/thoughts on whether this is a valid concern.

What I observe is that the tracer has no mechanism to prevent `ServicePoint.ConnectionLimit` being exceeded, and if we end up with, say 10-20 Flushes in progress with only 2 HTTP connections to process them, many of them may time out and then have their spans dropped.

|Requested Info| Value|
| --- | --- |
|Tested library version|v0.8.0.362|
|Runtime|NET Framework 4.8|
|Assembly target|NET Standard 2.0|

A timeout visible on the client via proxy might look something like:
Example request:
```
POST https://satellite/api/v2/reports HTTP/1.1
Accept: application/octet-stream
Content-Type: application/octet-stream
Lightstep-Access-Token: token
Host: satellite
Content-Length: 509
Expect: 100-continue
(no body sent)
```
Satellite response recorded by proxy 30 seconds later:
```
HTTP/1.1 408 Request body incomplete
Date: Thu, 13 Aug 2020 13:12:07 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache, must-revalidate
Timestamp: 14:12:07.486
The request body did not contain the specified number of bytes. Got 0, expected 509
```

This looks to be a problem if sending data to the satellite takes a while and there are a number of concurrent Flushes in progress - it's only been observed when the clients are very busy with lots of spans. It seems unlikely that it takes >30 seconds to send 509 bytes of body yet the same system can send 221 header characters just fine before that point. My reasoning is as follows.

Preamble:
* On NET Framework in Windows, HTTP is delegated via `ServicePoint` (per base address) which has a `ConnectionLimit` property (defaults to 2 courtesy of `ServicePointManager.DefaultConnectionLimit` default)
* In addition, we're forced to use HTTP 1 rather than HTTP 2 because NET Framework & supported versions of Windows don't have HTTP 2 available
* The older `WinHttpHandler` sends request header & body as two separate Win32 instructions

The tracing `Flush` is run via https://github.com/lightstep/lightstep-tracer-csharp/blob/bdb38ba67e7cdd02c4b120490f29f3ae87b86027/src/LightStep/Tracer.cs#L71

I believe there's a subtlety in the C# tracing code that can lead to intermittent timeouts which gets worse the greater the load:
* The above `Timer` means that every Report Period (default 5 seconds), a new `Flush` is called i.e. N `Flush` operations can [occur concurrently](https://docs.microsoft.com/en-us/dotnet/api/system.threading.timer?view=netcore-3.1#remarks)
* Each `Flush` calls `HttpClient.SendAsync` which will take some time to complete, so there can be N attempts to make an HTTP call in flight at once, getting worse the more spans occurring & shorter reporting periods chosen
* If the underlying `ServicePoint` connection isn't made available back to `WinHttpHandler` in time, the C# operation will time out due to the internal cancellation token, which is exactly what I see reported in the associated exception handler : `Timed out sending report to satellite`

If you agree, I would propose having a `WithConcurrentReports` option that links to a `SemaphoreSlim` & connection limit on the `HttpClientHandler`, so that NET Framework consumers can control & match outgoing concurrency safely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intermittent timeouts sending data to satellites when clients are under heavy load #92

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Requested Info	Value
Tested library version	v0.8.0.362
Runtime	NET Framework 4.8
Assembly target	NET Standard 2.0

Intermittent timeouts sending data to satellites when clients are under heavy load #92

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions