Skip to content

Streaming SDK proposal #2

Closed
Closed
@rnburn

Description

@rnburn

Hello Everyone,

I work on the OpenTracing C++ specification and the instrumentations for NGINX and Envoy. Recently, I've been working with LightStep to improve the performance of their C++ tracer. Our efforts to reduce instrumentation cost and improve collection throughput led us to a more efficient and streamlined design that I propose we adopt as one of the SDKs for OpenTelemetry. The key components of the design are 1) we remove intermediate storage on span objects and instead serialize eagerly as methods are called and 2) we use a domain-specific load balancing algorithm built upon non-blocking sockets, vectored-io, and io multiplexing.

Note: This only discusses the design as it releates to tracing. I plan to updated it to include metrics as well as the proposal progresses.

Design

At a high level, the design looks like this

  1. To each span we associate a chain of buffers that represent the span's protobuf serialization and we build up the serialization as methods on the span are called.
  2. When a span is finished, its serialization chain gets moved into a lock-free multiple-producer, single-consumer circular buffer (or discarded if the buffer is full).
  3. A separate recorder thread regularly flushes the serializations in the circular buffer and streams them to multiple endpoints using the HTTP/1.1 chunked encoding (one chunk per span) with domain-specific load balancing that works as follows:
    1. When the recorder flushes, it picks an available endpoint connection and makes a non-blocking vectored-io write system call to the connection's socket with the entire contents of the circular buffer.
    2. If the amount of data written is less than the size of the circular buffer, the recorder binds any remaining partially written span to the connection (to be written later when its socket is available for writing again) and repeates (i) with one of the other available connections.
    3. If no connections are available for writing, it blocks with epoll (or the target platform's equivalent) until one of the connections is available for writing again.

Here's a diagram of the architecture.

LightStep's C++ tracer provides an example implementation of this design. These are the main components

  1. The serialization chain for a span's protobuf serialization.
  2. The span class with no intermediate data storage and eager serialization
  3. The lock-free circular buffer for buffering spans in the recorder.
  4. The streaming recorder with domain-specific load balancing

Rational

Serializing eagerly in (1) instead of storing data in intermediary structures that get serialized later when the span is finished, eliminates unnecessary copying and allows us to avoid small heap allocations. This leads to much lower instrumentation cost. As part of my work on the LightStep tracer, I developed microbenchmarks that compare the eager serialization approach to a more traditional approach that stores data in protobuf-generated classes and serializes later when traces are uploaded. For a span with 10 small key-values attached, I got these measurements for the cost of starting then finishing a span that show much better performance for the eager serialization approach:

-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
BM_SpanSetTag10/late-serialization           5979 ns       4673 ns     148715
BM_SpanSetTag10/eager-serialization           845 ns        845 ns     833609

Using a lock-free circular buffer in (2) allows the tracer to remain performant in high concurrency scenarios. We've found that mutex-protected span buffering causes signficant contention when multiple threads finish spans concurrently. I benchmarked the creation of 4000 spans partioned evenly across a varying number of threads and got these results for a mutex-protected buffer:

-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
BM_SpanCreationThreaded/mutex/1 thread   4714869 ns      57876 ns       1000
BM_SpanCreationThreaded/mutex/2 threads  7351604 ns     104345 ns       1000
BM_SpanCreationThreaded/mutex/4 threads  8019629 ns     185328 ns       1000
BM_SpanCreationThreaded/mutex/8 threads  8582282 ns     328126 ns       1000

The mutex-protected buffer shows slower performance when we use more than a single thread. By comparison, lock-free buffering doesn't have any such degradation

-----------------------------------------------------------------------------
Benchmark                                      Time           CPU Iterations
-----------------------------------------------------------------------------
BM_SpanCreationThreaded/stream/1 thread  1120340 ns      55628 ns      11704
BM_SpanCreationThreaded/stream/2 threads 1375883 ns     111743 ns       7542
BM_SpanCreationThreaded/stream/4 threads 1672191 ns     182614 ns       3970
BM_SpanCreationThreaded/stream/8 threads 1007622 ns     265792 ns       2606

By using multiple load-balanced endpoint connections, the transport in (3) allows for spans to be uploaded at a high rate without dropping data. The domain-specific load balancer takes advantage of the property that spans can be routed to any collection endpoint and naturally adapts to back-pressure from the endpoints to route data to where its most capable of being received. Consider what happens as a collection endpoint starts to reach its capacity to process spans:

  1. At the network level, the endpoint ACKs the TCP packets of span data a slower rate.
  2. From the recorder side, the vectored-write system calls for that endpoint will send less data and the socket will block.
  3. The recorder will then naturally send spans to other endpoints until epoll reports that the overloaded endpoint's socket is no longer blocked and it's capable of receiving data.

Dependencies

A goal of the SDK is to have as minimal a set of dependencies as possible. Because the eager serialization approach uses manual serialization code instead of the protobuf-generated classes, we can avoid requiring protobuf as a dependency. Lightstep's current implementation uses libevent and c-ares for portable asynchronous networking and dns resolution -- but those parts of the code could be hid behind an interface in a way that would allow alternative libraries to be used or platform-specific implementations.

Customization Points

Because this approach never generates a SpanData-like structure with accessors, the customization points are different than the traditional exporter approach. The main point of customization point would be the serialization functions where a vendor could provide alternative implementations to write to a different wire format.

Next steps

The LightStep implementation could be adopted into a default OpenTelemetry tracer and SDK that uses the opentelemetry-proto format but provides a customization point for the serialization. This SDK would prioritize efficiency and high-throughput. While it might not be the right choice for all use cases, I would expect the OpenTelemetry C++ API to be flexible enough to support a variety of different impelmentations, so other use cases could be serviced by either an alternative impelmentation of the OpenTelemetry API or a different SDK (if we decide to support more than one).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions