Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design: add tracing design doc #3802

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions design/tracing-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Design for Supporting Tracing in Contour

Status: Draft

## Abstract
Envoy has rich support for [distributed tracing][1].
However, Contour does not currently offer a way to configure Envoy for tracing.
This design document proposes an implementation of tracing support in Contour.

## Background
Envoy's [documentation on tracing][1] says:
> Distributed tracing allows developers to obtain visualizations of call flows in large service oriented architectures. It can be invaluable in understanding serialization, parallelism, and sources of latency.
Envoy supports:
- generating traces for incoming requests
- joining existing traces when a client sets the `x-client-trace-id` header
- exporting trace data to various third-party providers (Zipkin, Jaeger, Datadog, etc.)

The tracing ecosystem is complex right now.
However, a few observations can be made:
- [OpenTelemetry][2] is a CNCF project which is working to become a standard in the space. It was formed as a merger of the OpenTracing and OpenCensus projects. It supports ingesting trace data in a variety of formats, transforming it, and exporting it to a variety of backends.
- [Zipkin][3] and [Jaeger][4] are two other common providers. Jaeger supports collecting data in the Zipkin format as well as in its own format.

## Goals
- **Support a single global trace configuration** that applies to all requests.
- **Support OpenCensus and Zipkin trace formats**, since together they enable a large set of common backends to be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCensus has folded into OpenTelemetry now. The opentelemetry collector can export to Zipkin using an exporter so I do not think supporting either should be needed.


## Non Goals
- **Per-HTTPProxy trace configuration.** There is a technical limitation to enabling per-proxy configuration. Trace settings are configured on the HTTP connection manager (HCM). Each TLS-enabled virtual host uses its own HCM, but *all* non-TLS virtual hosts share a single HCM. As such, while it's possible to have unique trace settings for each TLS virtual host, it's not possible to do the same for non-TLS virtual hosts. Note that this same limitation has come up when designing external authorization and global rate limiting as well.
- **Trace formats other than OpenCensus and Zipkin.** While Contour *may* choose to add direct support for other trace formats in the future, our hope is that by supporting the OpenTelemetry collector, users are able to export traces to the backend of their choice without having to directly configure it in Contour/Envoy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our hope is that by supporting the OpenTelemetry collector, users are able to export traces to the backend of their choice without having to directly configure it in Contour/Envoy.

+1


## High-Level Design
The Contour configuration file will be extended with a new optional `tracing` section.
This configuration block, if present, will enable tracing and will define the trace format and other properties needed to generate and export trace data.
Initially, the `opencensus` and `zipkin` trace formats will be supported.
Copy link

@pavelnikolov pavelnikolov Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you try to implement OpenTelemetry interface and leave it backend agnostic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavelnikolov as far as I know Envoy does not yet support OpenTelemetry: envoyproxy/envoy#9958

Otherwise, yes,, OTel is the direction we'd prefer. One option here is to just wait on implementing support in Contour for tracing until Envoy does support OTel directly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skriss it's already implemented in Envoy. Are you going to prioritise it now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd like to, we'll see what we can do.


The parameters defined in the configuration file will be used to configure each [HTTP connection manager's `tracing` section][5].

Since the tracing backend is pluggable, Contour will not package any particular backend.
The user is responsible for deploying and operating the tracing backend of their choice, and configuring Contour to make use of it.

## Detailed Design
The `tracing` section of the Contour config file will look like:

(WIP, this is the rough idea)
```yaml
tracing:
# Trace format to use. Valid options are opencensus and zipkin.
format: [opencensus, zipkin]
Copy link

@pavelnikolov pavelnikolov Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't care about the format if you use OpenTelemetry interface(s). Then the user can choose whatever exporter (backend) they see fit. What is more Zipkin is one of the supported backends for OpenTelemetry. So instead of format you might want to choose backend exporter here.

config:
# If tracing.format == opencensus, then this block is expected
# to be populated.
opencensus:
agent-address: otel-collector.default:55678
# If tracing.format == zipkin, then this block is expected
# to be populated.
zipkin:
collector-cluster: <Envoy cluster name>
collector-endpoint: /api/v2/spans
collector-endpoint-version: ...
```
In the presence of the `tracing` block above, all HTTP connection managers (HCM) will be configured for tracing.

A Zipkin-enabled HCM will include the following as an example:
```go
...
Tracing: &http.HttpConnectionManager_Tracing{
Provider: &envoy_config_trace_v3.Tracing_Http{
Name: "envoy.tracers.zipkin",
ConfigType: &envoy_config_trace_v3.Tracing_Http_TypedConfig{
TypedConfig: protobuf.MustMarshalAny(&envoy_config_trace_v3.ZipkinConfig{
CollectorCluster: "default/zipkin/9411/da39a3ee5e",
CollectorEndpoint: "/api/v2/spans",
CollectorEndpointVersion: envoy_config_trace_v3.ZipkinConfig_HTTP_JSON,
}),
},
},
},
...
```

An OpenCensus-enabled HCM will include the following as an example:
```go
...
Tracing: &http.HttpConnectionManager_Tracing{
Provider: &envoy_config_trace_v3.Tracing_Http{
Name: "envoy.tracers.opencensus",
ConfigType: &envoy_config_trace_v3.Tracing_Http_TypedConfig{
TypedConfig: protobuf.MustMarshalAny(&envoy_config_trace_v3.OpenCensusConfig{
OcagentAddress: "otel-collector.default:55678",
OcagentExporterEnabled: true,
}),
},
},
...
```

For Zipkin, an Envoy cluster corresponding to the Zipkin collector must exist. There are a few different options for creating this cluster:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this exactly what ExtensionService is for? I'd prefer to see us reuse that for the second and third options.

Copy link
Member Author

@skriss skriss Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple challenges with using ExtensionService here:

  • ExtensionService is tailored to gRPC services that are part of xDS, i.e. the services listed here, which these are not. Specifically, it includes the ProtocolVersion field and also requires either h2 or h2c.
  • Zipkin doesn't necessarily use gRPC (though it can). Right now, ExtensionService pretty much assumes the service is gRPC.
  • for OpenCensus, the API is gRPC, but Envoy's OC tracer does not allow using the Envoy gRPC client, only the Google gRPC client. As such, it can't be directly targeted at an Envoy cluster; it needs a gRPC address. I'm not sure off the top of my head how we'd make this work.

If we're committed to using ExtensionService here, then I can look more into what changes would be needed to make this work. TBH, I feel like this would be easier if instead of having a single generic ExtensionService type, we had AuthorizationService, RateLimitService, TracingService, etc., that maybe shared some common configuration but could also vary independently to support the variations among the different services. I know that's a bigger redesign though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this would be easier if instead of having a single generic ExtensionService type, we had AuthorizationService, RateLimitService, TracingService, etc

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last time we discussed this, @skriss was going to work on an updated ExtensionService design (which may include separate *Service objects for each use case). Is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I've been thinking about it and starting to write up a doc.

Copy link
Member Author

@skriss skriss Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we moved to a CRD per service type (ExtAuthService, RateLimitService, TracingService), we could make this work since each type could be significantly different - and tracing would need to be significantly different than the others given that it's not an xDS gRPC service, and, depending on the provider, does not make use of an Envoy cluster at all (which is effectively what an ExtensionService is right now). However, I think it's going to be difficult to move from the generic ExtensionService to a CRD per service type since it'll require what I believe is a breaking API change for ExtAuth (HTTPProxy currently has an implicitly-typed reference to an ExtensionService), and disruptive/breaking config file changes for rate limiting. If we wanted to do that, it might need to wait for a v2 API/v2 Contour.

Another option would be to add type-specific subfields to ExtensionService; however, because tracing is so different from the existing ExtensionService types, many of the existing fields in ExtensionService don't apply to tracing, so we end up with a messy API where many fields just don't apply in certain circumstances. I'm not a huge fan of this option given where we're starting from -- even though ExtensionService is "alpha", we don't want to break users if we can avoid it, which makes it harder to make in-place changes.

We could introduce just a TracingService CRD in the v1alpha1 group which would give us a chance to experiment with having a CRD per service type, and then look to migrate the other existing service types to specific CRDs down the road. I like this approach in theory, though there are some challenges specifically around OpenCensus since it can't be configured more than once per Envoy lifetime (that would apply even if we used the config file).

Or, we could use the config file to define everything.

Separately, since a main driver for defining an ExtensionService CRD to begin with was that it gave a way to report status, I think we should do that work for the existing service types. We could start simple by reporting whether or not a given ExtensionService has healthy Kubernetes Endpoints for its services, and then decide if we wanted to report more information from there.

Before going much further down any one of these paths, I'd appreciate some input from others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that summary @skriss. I kind of expected that if we were going to look at separate objects, we would start with TracingService and then come back for the others later, I think that is fine. That's why these are alpha, after all. If the way tracing needs to work is that the config file references the TracingService rather than configuring it at an object level, I thin that's fine. Having the TracingService then makes it easier to operate tracing sinks, since you can flip to a new one, check everything is working, then flip back if necessary (although obviously it's not that easy for OpenCensus if it's per-Envoy-lifetime).

I don't like the idea of adding extra fields to ExtensionService because of what you said - although the structs may live on in the new CRDs, I think the use cases are distinct enough to need separate things. We just need to keep an eye on how many we end up with.

I agree that we need to do something with Status for ExtensionService, and that "Are there a nonzero number of Endpoints available" is a good start on a Ready conditrion.

For moving the existing things like ExternalAuth references, we could do something like:

  • add GVK fields to the references, default to ExtensionService in one release.
  • Later, change default to reference AuthService instead. This is a breaking change, but an easy fix for users, just specify the kind.
  • Later, officially announce ExtensionService as deprecated and ask people to move to AuthorizationService.
    Total timeframe for this, I would guess nine to twelve months, however many releases that is.

- user explicitly defines the cluster as part of a custom bootstrap file
- Contour config file takes a Kubernetes service name, and a DAG processor adds a cluster for it to the DAG as a root.
- Contour config file takes an address, and a DAG processor adds a cluster for it to the DAG as a root.

For OpenCensus, Envoy can only be configured once per lifetime (see [this Envoy docs section][6] for information). Attempts to modify the OpenCensus configuration will cause Envoy to NACK config updates which, because of https://github.com/projectcontour/contour/issues/1176, will cause all future config updates to be ignored until Envoy is restarted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I was wrong about #1176 - at some point the behavior has changed and now the behavior is that no updates will be accepted until the incorrect config is removed, at which time Envoy will recover.


## Alternatives Considered
If there are alternative high level or detailed designs that were not pursued they should be called out here with a brief explanation of why they were not pursued.

- additional providers (could be added down the road if needed, but desire is to standardize on OpenTelemetry eventually)

## Security Considerations
If this proposal has an impact to the security of the product, its users, or data stored or transmitted via the product, they must be addressed here.

## Compatibility
A discussion of any compatibility issues that need to be considered

- Migration from OpenCensus to OpenTelemery in the future

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start with OpenTelemetry in the first place?


## Implementation
A description of the implementation, timelines, and any resources that have agreed to contribute.

## Open Issues
- is there any reason to use ExtensionService for this? (I think not; it's designed for the gRPC extension services and doesn't fit well here)
- OpenCensus can only be configured once per Envoy lifetime - so if a user changes the Contour config for OpenCensus and restarts Contour, Envoy will barf - how to deal with this?
- Zipkin requires an Envoy cluster to exist for the collector - how to get this into the DAG?

## Survey of Other Ingress Controllers

### Ambassador
https://www.getambassador.io/docs/edge-stack/latest/topics/running/services/tracing-service/
- zipkin, lightstep, datadog

### Gloo
https://docs.solo.io/gloo-edge/latest/guides/observability/tracing/
- anything Envoy supports

[1]: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/observability/tracing
[2]: https://opentelemetry.io/
[3]: https://zipkin.io/
[4]: https://www.jaegertracing.io/
[5]: https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto#envoy-v3-api-msg-extensions-filters-network-http-connection-manager-v3-httpconnectionmanager-tracing
[6]: https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto#extensions-filters-network-http-connection-manager-v3-httpconnectionmanager-tracing