Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-647: Update apiserver tracing to GA for 1.30 #4333

Merged
merged 5 commits into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-instrumentation/647.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ alpha:
approver: "@wojtek-t"
beta:
approver: "@wojtek-t"
stable:
approver: "@wojtek-t"
32 changes: 20 additions & 12 deletions keps/sig-instrumentation/647-apiserver-tracing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
- [Exporting Spans](#exporting-spans)
- [Running the OpenTelemetry Collector](#running-the-opentelemetry-collector)
- [APIServer Configuration and EgressSelectors](#apiserver-configuration-and-egressselectors)
- [Controlling use of the OpenTelemetry library](#controlling-use-of-the-opentelemetry-library)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit tests](#unit-tests)
Expand Down Expand Up @@ -167,10 +166,6 @@ type TracingConfiguration struct {

If `--opentelemetry-config-file` is not specified, the API Server will not send any spans, even if incoming requests ask for sampling.

### Controlling use of the OpenTelemetry library

As the community found in the [Metrics Stability Framework KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/1209-metrics-stability/kubernetes-control-plane-metrics-stability.md#kubernetes-control-plane-metrics-stability), having control over how the client libraries are used in kubernetes can enable maintainers to enforce policy and make broad improvements to the quality of telemetry. To enable future improvements to tracing, we will restrict the direct use of the OpenTelemetry library within the kubernetes code base, and provide wrapped versions of functions we wish to expose in a utility library.

### Test Plan

[X] I/we understand the owners of the involved components may require updates to
Expand All @@ -187,12 +182,16 @@ None.

##### Unit tests

- `staging/src/k8s.io/apiserver/pkg/server/options/tracing_test.go`: `10/10/2021`
- `staging/src/k8s.io/component-base/tracing/api/v1/config_test.go`: `10/10/21`
- `staging/src/k8s.io/apiserver/pkg/server/options/tracing_test.go`: `10/10/2021` 42.6%
- `staging/src/k8s.io/component-base/tracing/api/v1/config_test.go`: `10/10/2021` 59.0%

##### Integration tests

- ``test/integration/apiserver/tracing/tracing_test.go`
- `test/integration/apiserver/tracing/tracing_test.go`
- TestAPIServerTracingWithKMSv2: https://storage.googleapis.com/k8s-triage/index.html?pr=1&job=integration&test=TestAPIServerTracingWithKMSv2
- TestAPIServerTracingWithEgressSelector: https://storage.googleapis.com/k8s-triage/index.html?pr=1&job=integration&test=TestAPIServerTracingWithEgressSelector
- TestAPIServerTracing: https://storage.googleapis.com/k8s-triage/index.html?pr=1&job=integration&test=TestAPIServerTracing


##### e2e tests

Expand All @@ -207,15 +206,20 @@ Alpha

Beta

- [] Tracing 100% of requests does not break scalability tests (this does not necessarily mean trace backends can handle all the data).
- [X] Tracing 100% of requests does not break scalability tests (this does not necessarily mean trace backends can handle all the data).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple smaller comments:

  1. For testing section:
  • can you please add the coverage numbers (as in the template)
  • for integration tests - can you please link to k8s-triage (as described in the template)
  1. for beta criteria:

Tracing 100% of requests does not break scalability tests (this does not necessarily mean trace backends can handle all the data).

For my own education - how did you do that? I think we didn't modify the periodic tests - should we?

  1. SLIs question - are there any plans to have it addressed on OpenTelemetry?

  2. Scalability section - there is a new question to answer:
    https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc

Copy link
Contributor Author

@dashpole dashpole Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a one-time test with tracing set at 100%: kubernetes/kubernetes#113695 (comment). It isn't continuous. I don't think we should modify the periodic tests. If we did, it should be at a reasonably low sampling rate to make traces available for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SLIs, some in OTel are working on proposals: open-telemetry/oteps#238. I don't know when it will land, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed points 1 and 4 in 4500ad8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Then just one more follow-up on the scalability tests:

  1. Can you maybe link the comment about the passing thread to the KEP itself too
  2. Co you add a sentence that we're consciously not modifying the periodic tests and we will consider sampling at low rate in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 47c37e4

- Verified in a manual run: https://github.com/kubernetes/kubernetes/pull/113695#issuecomment-1307665358. This is not part of periodic tests, although it may be useful for debugging with a low sampling rate in the future.
- [X] OpenTelemetry reaches GA
- [] Publish examples of how to use the OT Collector with kubernetes
- [X] Publish examples of how to use the OT Collector with kubernetes
- [X] Allow time for feedback
- [] Revisit the format used to export spans.
- [] Parity with the old text-based Traces
- [X] Revisit the format used to export spans.
- [X] Parity with the old text-based Traces

GA

- [ ] Publish guidelines for kubernetes components on when and how to add tracing to a component.
- [ ] Graduate the TracingConfiguration component config to v1.
- [ ] Define and document stability guarantees for trace instrumentation.
- [ ] Add support for On-Demand trace collection as described above.

### Upgrade / Downgrade Strategy

Expand Down Expand Up @@ -333,6 +337,10 @@ previous answers based on experience in the field._
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
The tracing client library has a small, in-memory cache for outgoing spans. Based on current benchmarks, a full cache could use as much as 5 Mb of memory.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No. Collecting and exporter spans does not use additional node resources even when it is failing to connect to the backend.

### Troubleshooting

The Troubleshooting section currently serves the `Playbook` role. We may consider
Expand Down
7 changes: 4 additions & 3 deletions keps/sig-instrumentation/647-apiserver-tracing/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,13 @@ approvers:
- "@lavalamp"
see-also:
replaces:
stage: beta
last-updated: 2022-09-29
latest-milestone: "v1.27"
stage: stable
last-updated: 2023-11-10
latest-milestone: "v1.30"
milestone:
alpha: "v1.22"
beta: "v1.27"
stable: "v1.30"
feature-gates:
- name: APIServerTracing
disable-supported: true
Expand Down