Description
Hello Everyone,
I work on the OpenTracing C++ specification and the instrumentations for NGINX and Envoy. Recently, I've been working with LightStep to improve the performance of their C++ tracer. Our efforts to reduce instrumentation cost and improve collection throughput led us to a more efficient and streamlined design that I propose we adopt as one of the SDKs for OpenTelemetry. The key components of the design are 1) we remove intermediate storage on span objects and instead serialize eagerly as methods are called and 2) we use a domain-specific load balancing algorithm built upon non-blocking sockets, vectored-io, and io multiplexing.
Note: This only discusses the design as it releates to tracing. I plan to updated it to include metrics as well as the proposal progresses.
Design
At a high level, the design looks like this
- To each span we associate a chain of buffers that represent the span's protobuf serialization and we build up the serialization as methods on the span are called.
- When a span is finished, its serialization chain gets moved into a lock-free multiple-producer, single-consumer circular buffer (or discarded if the buffer is full).
- A separate recorder thread regularly flushes the serializations in the circular buffer and streams them to multiple endpoints using the HTTP/1.1 chunked encoding (one chunk per span) with domain-specific load balancing that works as follows:
- When the recorder flushes, it picks an available endpoint connection and makes a non-blocking vectored-io write system call to the connection's socket with the entire contents of the circular buffer.
- If the amount of data written is less than the size of the circular buffer, the recorder binds any remaining partially written span to the connection (to be written later when its socket is available for writing again) and repeates (i) with one of the other available connections.
- If no connections are available for writing, it blocks with epoll (or the target platform's equivalent) until one of the connections is available for writing again.
Here's a diagram of the architecture.
LightStep's C++ tracer provides an example implementation of this design. These are the main components
- The serialization chain for a span's protobuf serialization.
- The span class with no intermediate data storage and eager serialization
- The lock-free circular buffer for buffering spans in the recorder.
- The streaming recorder with domain-specific load balancing
Rational
Serializing eagerly in (1) instead of storing data in intermediary structures that get serialized later when the span is finished, eliminates unnecessary copying and allows us to avoid small heap allocations. This leads to much lower instrumentation cost. As part of my work on the LightStep tracer, I developed microbenchmarks that compare the eager serialization approach to a more traditional approach that stores data in protobuf-generated classes and serializes later when traces are uploaded. For a span with 10 small key-values attached, I got these measurements for the cost of starting then finishing a span that show much better performance for the eager serialization approach:
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------
BM_SpanSetTag10/late-serialization 5979 ns 4673 ns 148715
BM_SpanSetTag10/eager-serialization 845 ns 845 ns 833609
Using a lock-free circular buffer in (2) allows the tracer to remain performant in high concurrency scenarios. We've found that mutex-protected span buffering causes signficant contention when multiple threads finish spans concurrently. I benchmarked the creation of 4000 spans partioned evenly across a varying number of threads and got these results for a mutex-protected buffer:
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------
BM_SpanCreationThreaded/mutex/1 thread 4714869 ns 57876 ns 1000
BM_SpanCreationThreaded/mutex/2 threads 7351604 ns 104345 ns 1000
BM_SpanCreationThreaded/mutex/4 threads 8019629 ns 185328 ns 1000
BM_SpanCreationThreaded/mutex/8 threads 8582282 ns 328126 ns 1000
The mutex-protected buffer shows slower performance when we use more than a single thread. By comparison, lock-free buffering doesn't have any such degradation
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------
BM_SpanCreationThreaded/stream/1 thread 1120340 ns 55628 ns 11704
BM_SpanCreationThreaded/stream/2 threads 1375883 ns 111743 ns 7542
BM_SpanCreationThreaded/stream/4 threads 1672191 ns 182614 ns 3970
BM_SpanCreationThreaded/stream/8 threads 1007622 ns 265792 ns 2606
By using multiple load-balanced endpoint connections, the transport in (3) allows for spans to be uploaded at a high rate without dropping data. The domain-specific load balancer takes advantage of the property that spans can be routed to any collection endpoint and naturally adapts to back-pressure from the endpoints to route data to where its most capable of being received. Consider what happens as a collection endpoint starts to reach its capacity to process spans:
- At the network level, the endpoint ACKs the TCP packets of span data a slower rate.
- From the recorder side, the vectored-write system calls for that endpoint will send less data and the socket will block.
- The recorder will then naturally send spans to other endpoints until epoll reports that the overloaded endpoint's socket is no longer blocked and it's capable of receiving data.
Dependencies
A goal of the SDK is to have as minimal a set of dependencies as possible. Because the eager serialization approach uses manual serialization code instead of the protobuf-generated classes, we can avoid requiring protobuf as a dependency. Lightstep's current implementation uses libevent and c-ares for portable asynchronous networking and dns resolution -- but those parts of the code could be hid behind an interface in a way that would allow alternative libraries to be used or platform-specific implementations.
Customization Points
Because this approach never generates a SpanData-like structure with accessors, the customization points are different than the traditional exporter approach. The main point of customization point would be the serialization functions where a vendor could provide alternative implementations to write to a different wire format.
Next steps
The LightStep implementation could be adopted into a default OpenTelemetry tracer and SDK that uses the opentelemetry-proto format but provides a customization point for the serialization. This SDK would prioritize efficiency and high-throughput. While it might not be the right choice for all use cases, I would expect the OpenTelemetry C++ API to be flexible enough to support a variety of different impelmentations, so other use cases could be serviced by either an alternative impelmentation of the OpenTelemetry API or a different SDK (if we decide to support more than one).