Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design: add tracing design doc #3802

Closed
wants to merge 1 commit into from

Conversation

skriss
Copy link
Member

@skriss skriss commented Jun 11, 2021

Putting this up as a draft in case anyone wants to preview. I"m still actively refining and working out some details.

Closes #3792.
xref #399.

Signed-off-by: Steve Kriss krisss@vmware.com

Closes projectcontour#3792.

Signed-off-by: Steve Kriss <krisss@vmware.com>
@skriss skriss added the kind/design Categorizes issue or PR as related to design. label Jun 11, 2021
@codecov
Copy link

codecov bot commented Jun 11, 2021

Codecov Report

Merging #3802 (a0e0e90) into main (caedf10) will not change coverage.
The diff coverage is n/a.

❗ Current head a0e0e90 differs from pull request most recent head 8a1fed2. Consider uploading reports for the commit 8a1fed2 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##             main    #3802   +/-   ##
=======================================
  Coverage   75.83%   75.83%           
=======================================
  Files         107      107           
  Lines        7338     7338           
=======================================
  Hits         5565     5565           
  Misses       1652     1652           
  Partials      121      121           


## Non Goals
- **Per-HTTPProxy trace configuration.** There is a technical limitation to enabling per-proxy configuration. Trace settings are configured on the HTTP connection manager (HCM). Each TLS-enabled virtual host uses its own HCM, but *all* non-TLS virtual hosts share a single HCM. As such, while it's possible to have unique trace settings for each TLS virtual host, it's not possible to do the same for non-TLS virtual hosts. Note that this same limitation has come up when designing external authorization and global rate limiting as well.
- **Trace formats other than OpenCensus and Zipkin.** While Contour *may* choose to add direct support for other trace formats in the future, our hope is that by supporting the OpenTelemetry collector, users are able to export traces to the backend of their choice without having to directly configure it in Contour/Envoy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our hope is that by supporting the OpenTelemetry collector, users are able to export traces to the backend of their choice without having to directly configure it in Contour/Envoy.

+1

Copy link
Member

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, @skriss! Do you think there'll be anything for the alternatives or security concerns sections (fine if no)?

...
```

For Zipkin, an Envoy cluster corresponding to the Zipkin collector must exist. There are a few different options for creating this cluster:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this exactly what ExtensionService is for? I'd prefer to see us reuse that for the second and third options.

Copy link
Member Author

@skriss skriss Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple challenges with using ExtensionService here:

  • ExtensionService is tailored to gRPC services that are part of xDS, i.e. the services listed here, which these are not. Specifically, it includes the ProtocolVersion field and also requires either h2 or h2c.
  • Zipkin doesn't necessarily use gRPC (though it can). Right now, ExtensionService pretty much assumes the service is gRPC.
  • for OpenCensus, the API is gRPC, but Envoy's OC tracer does not allow using the Envoy gRPC client, only the Google gRPC client. As such, it can't be directly targeted at an Envoy cluster; it needs a gRPC address. I'm not sure off the top of my head how we'd make this work.

If we're committed to using ExtensionService here, then I can look more into what changes would be needed to make this work. TBH, I feel like this would be easier if instead of having a single generic ExtensionService type, we had AuthorizationService, RateLimitService, TracingService, etc., that maybe shared some common configuration but could also vary independently to support the variations among the different services. I know that's a bigger redesign though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this would be easier if instead of having a single generic ExtensionService type, we had AuthorizationService, RateLimitService, TracingService, etc

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last time we discussed this, @skriss was going to work on an updated ExtensionService design (which may include separate *Service objects for each use case). Is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I've been thinking about it and starting to write up a doc.

Copy link
Member Author

@skriss skriss Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we moved to a CRD per service type (ExtAuthService, RateLimitService, TracingService), we could make this work since each type could be significantly different - and tracing would need to be significantly different than the others given that it's not an xDS gRPC service, and, depending on the provider, does not make use of an Envoy cluster at all (which is effectively what an ExtensionService is right now). However, I think it's going to be difficult to move from the generic ExtensionService to a CRD per service type since it'll require what I believe is a breaking API change for ExtAuth (HTTPProxy currently has an implicitly-typed reference to an ExtensionService), and disruptive/breaking config file changes for rate limiting. If we wanted to do that, it might need to wait for a v2 API/v2 Contour.

Another option would be to add type-specific subfields to ExtensionService; however, because tracing is so different from the existing ExtensionService types, many of the existing fields in ExtensionService don't apply to tracing, so we end up with a messy API where many fields just don't apply in certain circumstances. I'm not a huge fan of this option given where we're starting from -- even though ExtensionService is "alpha", we don't want to break users if we can avoid it, which makes it harder to make in-place changes.

We could introduce just a TracingService CRD in the v1alpha1 group which would give us a chance to experiment with having a CRD per service type, and then look to migrate the other existing service types to specific CRDs down the road. I like this approach in theory, though there are some challenges specifically around OpenCensus since it can't be configured more than once per Envoy lifetime (that would apply even if we used the config file).

Or, we could use the config file to define everything.

Separately, since a main driver for defining an ExtensionService CRD to begin with was that it gave a way to report status, I think we should do that work for the existing service types. We could start simple by reporting whether or not a given ExtensionService has healthy Kubernetes Endpoints for its services, and then decide if we wanted to report more information from there.

Before going much further down any one of these paths, I'd appreciate some input from others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that summary @skriss. I kind of expected that if we were going to look at separate objects, we would start with TracingService and then come back for the others later, I think that is fine. That's why these are alpha, after all. If the way tracing needs to work is that the config file references the TracingService rather than configuring it at an object level, I thin that's fine. Having the TracingService then makes it easier to operate tracing sinks, since you can flip to a new one, check everything is working, then flip back if necessary (although obviously it's not that easy for OpenCensus if it's per-Envoy-lifetime).

I don't like the idea of adding extra fields to ExtensionService because of what you said - although the structs may live on in the new CRDs, I think the use cases are distinct enough to need separate things. We just need to keep an eye on how many we end up with.

I agree that we need to do something with Status for ExtensionService, and that "Are there a nonzero number of Endpoints available" is a good start on a Ready conditrion.

For moving the existing things like ExternalAuth references, we could do something like:

  • add GVK fields to the references, default to ExtensionService in one release.
  • Later, change default to reference AuthService instead. This is a breaking change, but an easy fix for users, just specify the kind.
  • Later, officially announce ExtensionService as deprecated and ask people to move to AuthorizationService.
    Total timeframe for this, I would guess nine to twelve months, however many releases that is.

- Contour config file takes a Kubernetes service name, and a DAG processor adds a cluster for it to the DAG as a root.
- Contour config file takes an address, and a DAG processor adds a cluster for it to the DAG as a root.

For OpenCensus, Envoy can only be configured once per lifetime (see [this Envoy docs section][6] for information). Attempts to modify the OpenCensus configuration will cause Envoy to NACK config updates which, because of https://github.com/projectcontour/contour/issues/1176, will cause all future config updates to be ignored until Envoy is restarted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I was wrong about #1176 - at some point the behavior has changed and now the behavior is that no updates will be accepted until the incorrect config is removed, at which time Envoy will recover.

@github-actions
Copy link

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2021
## High-Level Design
The Contour configuration file will be extended with a new optional `tracing` section.
This configuration block, if present, will enable tracing and will define the trace format and other properties needed to generate and export trace data.
Initially, the `opencensus` and `zipkin` trace formats will be supported.
Copy link

@pavelnikolov pavelnikolov Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you try to implement OpenTelemetry interface and leave it backend agnostic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavelnikolov as far as I know Envoy does not yet support OpenTelemetry: envoyproxy/envoy#9958

Otherwise, yes,, OTel is the direction we'd prefer. One option here is to just wait on implementing support in Contour for tracing until Envoy does support OTel directly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skriss it's already implemented in Envoy. Are you going to prioritise it now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd like to, we'll see what we can do.

```yaml
tracing:
# Trace format to use. Valid options are opencensus and zipkin.
format: [opencensus, zipkin]
Copy link

@pavelnikolov pavelnikolov Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't care about the format if you use OpenTelemetry interface(s). Then the user can choose whatever exporter (backend) they see fit. What is more Zipkin is one of the supported backends for OpenTelemetry. So instead of format you might want to choose backend exporter here.

## Compatibility
A discussion of any compatibility issues that need to be considered

- Migration from OpenCensus to OpenTelemery in the future

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start with OpenTelemetry in the first place?

@github-actions github-actions bot closed this Aug 28, 2021
@yoitsro
Copy link
Contributor

yoitsro commented Nov 3, 2021

Should this be closed?

@youngnick
Copy link
Member

This is blocked behind a redesign of ExtensionService at the moment, which would move from the base ExtensionService (currently used for both external auth and rate limiting), to having separate CRDs, for AuthenticationService, RateLimitingService, and then add TracingService. The original ExtensionService idea ends up too overloaded with all the stuff that needs to be configured for tracing, so it's better to have a single object type per use-case.

Once we've done that work, it should be reasonably straightforward to add the tracing support. We just have to pay down the tech debt first.

@rajatvig
Copy link
Member

rajatvig commented Feb 3, 2022

@youngnick Is there an issue tracking the ExtensionService refactoring? Could I help?


## Goals
- **Support a single global trace configuration** that applies to all requests.
- **Support OpenCensus and Zipkin trace formats**, since together they enable a large set of common backends to be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCensus has folded into OpenTelemetry now. The opentelemetry collector can export to Zipkin using an exporter so I do not think supporting either should be needed.

@yangyy93
Copy link
Member

yangyy93 commented Dec 8, 2022

Envoy already supports OpenTelemetry should the design document be updated to only support exporting tracing to OpenTelemetry?
@skriss

@sunjayBhatia
Copy link
Member

@yangyy93 yep I believe so, which should also make our work here a lot easier since it requires a gRPC cluster and not a static cluster or http cluster etc. that other export formats needed I believe

@sunjayBhatia
Copy link
Member

If anyone involved wants to take this over feel free to start a new design PR, can use this as a reference!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

write design doc for tracing support
8 participants