Streaming SDK proposal

Hello Everyone,

I work on the OpenTracing C++ specification and the instrumentations for NGINX and Envoy. Recently, I've been working with LightStep to improve the performance of their C++ tracer. Our efforts to reduce instrumentation cost and improve collection throughput led us to a more efficient and streamlined design that I propose we adopt as one of the SDKs for OpenTelemetry. The key components of the design are 1) we remove intermediate storage on span objects and instead serialize eagerly as methods are called and 2) we use a domain-specific load balancing algorithm built upon non-blocking sockets, vectored-io, and io multiplexing.

Note: This only discusses the design as it releates to tracing. I plan to updated it to include metrics as well as the proposal progresses.

### Design
At a high level, the design looks like this
1. To each span we associate a chain of buffers that represent the span's protobuf serialization and we build up the serialization as methods on the span are called.
2. When a span is finished, its serialization chain gets moved into a lock-free multiple-producer, single-consumer circular buffer (or discarded if the buffer is full).
3. A separate recorder thread regularly flushes the serializations in the circular buffer and streams them to multiple endpoints using the HTTP/1.1 chunked encoding (one chunk per span) with domain-specific load balancing that works as follows:
    1. When the recorder flushes, it picks an available endpoint connection and makes a non-blocking [vectored-io write](https://linux.die.net/man/2/writev) system call to the connection's socket with the entire contents of the circular buffer.
    2. If the amount of data written is less than the size of the circular buffer, the recorder binds any remaining partially written span to the connection (to be written later when its socket is available for writing again) and repeates (i) with one of the other available connections.
    3. If no connections are available for writing, it blocks with [epoll](https://linux.die.net/man/2/writev) (or the target platform's equivalent) until one of the connections is available for writing again.

[Here's](https://github.com/lightstep/lightstep-tracer-cpp/blob/master/doc/diagram.png) a diagram of the architecture.

    
LightStep's C++ tracer provides an example implementation of this design. These are the main components
1. The [serialization chain](https://github.com/lightstep/lightstep-tracer-cpp/blob/9f690475084e1aa76ef477674d7a2d0102919d6c/src/common/serialization_chain.h) for a span's protobuf serialization.
2. The [span class](https://github.com/lightstep/lightstep-tracer-cpp/blob/9f690475084e1aa76ef477674d7a2d0102919d6c/src/tracer/span.h) with no intermediate data storage and eager serialization
3. The lock-free [circular buffer](https://github.com/lightstep/lightstep-tracer-cpp/blob/9f690475084e1aa76ef477674d7a2d0102919d6c/src/common/circular_buffer.h) for buffering spans in the recorder.
4. The [streaming recorder](https://github.com/lightstep/lightstep-tracer-cpp/blob/9f690475084e1aa76ef477674d7a2d0102919d6c/src/recorder/stream_recorder/stream_recorder.h) with domain-specific load balancing

### Rational
Serializing eagerly in (1) instead of storing data in intermediary structures that get serialized later when the span is finished, eliminates unnecessary copying and allows us to avoid small heap allocations. This leads to much lower instrumentation cost. As part of my work on the LightStep tracer, I developed [microbenchmarks](https://github.com/lightstep/lightstep-tracer-cpp/blob/9f690475084e1aa76ef477674d7a2d0102919d6c/benchmark/span_operations_benchmark.cpp) that compare the eager serialization approach to a more traditional approach that stores data in protobuf-generated classes and serializes later when traces are uploaded. For a span with 10 small key-values attached, I got these measurements for the cost of starting then finishing a span that show much better performance for the eager serialization approach:

```
-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
BM_SpanSetTag10/late-serialization           5979 ns       4673 ns     148715
BM_SpanSetTag10/eager-serialization           845 ns        845 ns     833609
```


Using a lock-free circular buffer in (2) allows the tracer to remain performant in high concurrency scenarios. We've found that mutex-protected span buffering causes signficant contention when multiple threads finish spans concurrently. I benchmarked the creation of 4000 spans partioned evenly across a varying number of threads and got these results for a mutex-protected buffer:

```
-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
BM_SpanCreationThreaded/mutex/1 thread   4714869 ns      57876 ns       1000
BM_SpanCreationThreaded/mutex/2 threads  7351604 ns     104345 ns       1000
BM_SpanCreationThreaded/mutex/4 threads  8019629 ns     185328 ns       1000
BM_SpanCreationThreaded/mutex/8 threads  8582282 ns     328126 ns       1000
```

The mutex-protected buffer shows slower performance when we use more than a single thread. By comparison, lock-free buffering doesn't have any such degradation
```
-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
BM_SpanCreationThreaded/stream/1 thread  1120340 ns      55628 ns      11704
BM_SpanCreationThreaded/stream/2 threads 1375883 ns     111743 ns       7542
BM_SpanCreationThreaded/stream/4 threads 1672191 ns     182614 ns       3970
BM_SpanCreationThreaded/stream/8 threads 1007622 ns     265792 ns       2606
```

By using multiple load-balanced endpoint connections, the transport in (3) allows for spans to be uploaded at a high rate without dropping data. The domain-specific load balancer takes advantage of the property that spans can be routed to any collection endpoint and naturally adapts to back-pressure from the endpoints to route data to where its most capable of being received. Consider what happens as a collection endpoint starts to reach its capacity to process spans: 
1. At the network level, the endpoint ACKs the TCP packets of span data a slower rate.
2. From the recorder side, the vectored-write system calls for that endpoint will send less data and the socket will block.
3. The recorder will then naturally send spans to other endpoints until epoll reports that the overloaded endpoint's socket is no longer blocked and it's capable of receiving data.

### Dependencies

A goal of the SDK is to have as minimal a set of dependencies as possible. Because the eager serialization approach uses manual serialization code instead of the protobuf-generated classes, we can avoid requiring protobuf as a dependency. Lightstep's current implementation uses libevent and c-ares for portable asynchronous networking and dns resolution -- but those parts of the code could be hid behind an interface in a way that would allow alternative libraries to be used or platform-specific implementations.

### Customization Points

Because this approach never generates a SpanData-like structure with accessors, the customization points are different than the traditional exporter approach. The main point of customization point would be the serialization functions where a vendor could provide alternative implementations to write to a different wire format.

### Next steps

The LightStep implementation could be adopted into a default OpenTelemetry tracer and SDK that uses the [opentelemetry-proto](https://github.com/open-telemetry/opentelemetry-proto) format but provides a customization point for the serialization. This SDK would prioritize efficiency and high-throughput. While it might not be the right choice for all use cases, I would expect the OpenTelemetry C++ API to be flexible enough to support a variety of different impelmentations, so other use cases could be serviced by either an alternative impelmentation of the OpenTelemetry API or a different SDK (if we decide to support more than one).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streaming SDK proposal #2

Design

Rational

Dependencies

Customization Points

Next steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Streaming SDK proposal #2

Description

Design

Rational

Dependencies

Customization Points

Next steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions