Trace context propagation #2312

zhijianli88 · 2021-01-22T03:21:51Z

This KEP is a rewrite of Propose log tracking KEP based on this suggestion.

In brief, the previous KEP contains 2 topics: propagating trace context and adding context information to logs, which can lead to lot of discussion and making it hard to move forward, so rewrite the KEP to only focuses on a generic trace context propagating proposal as suggested.

Old PR #1961
Issue: #1668

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>

Signed-off-by: Guangwen Feng <fenggw-fnst@cn.fujitsu.com>

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>

k8s-ci-robot · 2021-01-22T03:21:58Z

Welcome @zhijianli88!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2021-01-22T03:21:59Z

Hi @zhijianli88. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kikisdeliveryservice · 2021-01-25T22:12:02Z

Related to: #1668

ehashman · 2021-01-26T03:15:57Z

/retitle Trace context propagation

ehashman

I'll try to take a look at this later this week.

keps/sig-instrumentation/1668-trace-popagating/kep.yaml

ehashman · 2021-01-27T17:49:11Z

/ok-to-test

Signed-off-by: Guangwen Feng <fenggw-fnst@cn.fujitsu.com>

keps/sig-instrumentation/1668-trace-popagating/README.md

dashpole · 2021-01-28T16:32:51Z

keps/sig-instrumentation/1668-trace-popagating/README.md

+
+### Out-of-tree changes
+
+#### Mutating webhook


This also needs to address the problem of status updates. Status updates are writes, but don't indicate a change in desired state. For example, if we have the following sequence of events

Create Deployment

Update Deployment.Replicas from 1 -> 2

Status update after creation

Since the status update (3) is done with the trace context from (1), it will overwrite the trace context from (2), which I don't think we want.

Status updates are writes, but don't indicate a change in desired state

Sorry, i cannot understand the problem you mentioned. I'm not very much clear about "status update", may i know which components make the 'status updates' request. I used to think(i hope) the only below 2 events happen.

1. Kubectl create deployment deployment traceID: AAAA replicaset traceID: AAAA pod traceID: AAAA 2. Kunectl edit replicaset:1->2 deployment traceID: BBBB replicaset traceID: BBBB pod traceID: AAAA (remain the same because the old pod is not recreated) pod2 traceID: BBBB

I don't know when and how (3) happens. could you explain more details.

I'll try and be more specific--the kube-controller manager does the following (very simplified):

Observe that a deployment is created

Create a replicaset based on the deployment

Observe the status of the replicaset, when it changes... update the deployment's status. For example, it might update the availableReplicas field after a replicaset replica becomes available.

There are a few approaches I can think of that we could take to solve this problem:

Don't propagate context (using WithObject) to status update requests. This means status updates are not linked back to the initial kubectl call.

Make the webhook NOT write the annotation if the update is a status update. For example, we could exclude all calls to subresources using the resources rule in the admission registration config.

Propagate baggage in addition to the trace context in annotations. Add a key in baggage, e.g. k8s.io/nonmutating: true to indicate that an API call, like a status update, isn't changing the desired state of an object. Since baggage is propagated along with the trace context to the webhook, the webhook can skip writing the annotation when baggage contains our key above.
We could provide a helper function to do this, similar to WithObject():

ctx = WithObject(ctx, obj) ctx = WithMutating(ctx, false) client.UpdateFooStatus(ctx, objstatus)

We could go even further, and move the WithMutating function inside UpdateFooStatus...

Its probably best to start with option 1 or 2, as they are simpler.

Understood. Thank you so much for detailed explaination!
I prefer option 2, as it will bring less in-tree changes.

keps/sig-instrumentation/1668-trace-popagating/README.md

ehashman · 2021-01-28T17:57:45Z

/sig api-machinery

fenggw-fnst · 2021-07-14T05:00:01Z

Here is 2 examples to show how trace contexts changes if we use above method.

Trace context propagation example for creation:

Assume user A creates a Deployment, and user B concurrently patches it before controller reconcile:

apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2      
spec:
status:
  observedGeneration: 0

Controller creates ReplicaSet according to the Deployment, propagates the traceContexts which the traceGeneration > observedGeneration, so the reconcile represents the result of both kubectl-create and kubectl-patch:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 0

When controller updates Deployment's status, similarly, propagates the traceContexts which the traceGeneration > observedGeneration, until the observedGeneration is updated (observedGeneration == generation == the greatest traceGeneration:

apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
spec:
status:
  observedGeneration: 2

Controller creates Pods according to the ReplicaSet, propagates the traceContexts which the traceGeneration > observedGeneration, then kubelet updates the status with the latest propagated traceContext:

apiVersion: v1
kind: Pod
metadata:
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:02Z"
      traceContext: {"A", "B"}
    - fieldsV1:
        f:status:
      manager: kubelet
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
spec:
status:

When controller updates ReplicaSet's status, it is similar to Deployment, propagates the traceContext which the traceGeneration > observedGeneration, until the observedGeneration is updated (observedGeneration == generation == the greatest traceGeneration):

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:02Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 1

After desire has been reconciled (observedGeneration == generation == the greatest traceGeneration), controller updates status of Deployment and ReplicaSet with the latest traceContext:

apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
spec:
status:
  observedGeneration: 2

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:02Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 1

Trace context propagation example for update:

Assume user A scales Deployment's replica from 4 to 8, and user B concurrently edits it to a invalid image before controller reconcile:

apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 4
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kubectl-scale
      time: "2021-07-07T12:01:00Z"
      traceContext: {"C"}
      traceGeneration: 3
    - fieldsV1:
        f:spec:
          f:template:
            f:spec:
              f:containers:
                k:{}:
                  f:image: {}
      manager: kubectl-edit
      time: "2021-07-07T12:01:00Z"
      traceContext: {"D"}
      traceGeneration: 4
spec:
status:
  observedGeneration: 2

Controller creates a new ReplicaSet with Deployment's traceContexts of traceGeneration > observedGeneration:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 0

Since it has been set that replicas == 8, controller will also scale up the old ReplicaSet to 6 (assume default maxSurge and maxUnavailable) with Deployment's traceContexts of traceGeneration > observedGeneration:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:02Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
spec:
status:
  observedGeneration: 1

After controller has reconciled the intents for Deployment, it will also update the its status with the traceContexts of traceGeneration > observedGeneration:

apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 4
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kubectl-scale
      time: "2021-07-07T12:01:00Z"
      traceContext: {"C"}
      traceGeneration: 3
    - fieldsV1:
        f:spec:
          f:template:
            f:spec:
              f:containers:
                k:{}:
                  f:image: {}
      manager: kubectl-edit
      time: "2021-07-07T12:01:00Z"
      traceContext: {"D"}
      traceGeneration: 4
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 4
spec:
status:
  observedGeneration: 4

For the new ReplicaSet, it will try to create 4 Pods at first (assume default maxSurge and maxUnavailable), because the image is invalid, actually Pods creation will be stuck. Controller will propagate the new ReplicaSet's traceContexts of traceGeneration > observedGeneration to these Pods, then kubelet updates the status with the latest propagated traceContext:

apiVersion: v1
kind: Pod
metadata:
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
    - fieldsV1:
        f:status:
      manager: kubelet
      time: "2021-07-07T12:01:03Z"
      traceContext: {"C", "D"}
spec:
status:

Then controller will also update the ReplicaSet's status with the traceContexts of traceGeneration > observedGeneration:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 1

For the old ReplicaSet, it propagates the traceContexts of traceGeneration > observedGeneration to the Pods scaled up, then kubelet updates the status with the latest propagated traceContext:

apiVersion: v1
kind: Pod
metadata:
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
    - fieldsV1:
        f:status:
      manager: kubelet
      time: "2021-07-07T12:01:03Z"
      traceContext: {"C", "D"}
spec:
status:

Controller will also update the old ReplicaSet status with the Deployment's traceContexts of traceGeneration > observedGeneration:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
spec:
status:
  observedGeneration: 2

After observedGeneration == generation == the greatest traceGeneration, any further status update for Deployment and ReplicaSet will propagate traceContext with the latest time stamp.

apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 4
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kubectl-scale
      time: "2021-07-07T12:01:00Z"
      traceContext: {"C"}
      traceGeneration: 3
    - fieldsV1:
        f:spec:
          f:template:
            f:spec:
              f:containers:
                k:{}:
                  f:image: {}
      manager: kubectl-edit
      time: "2021-07-07T12:01:00Z"
      traceContext: {"D"}
      traceGeneration: 4
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 4
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:03Z"
      traceContext: {"C", "D"}
      traceGeneration: 4
spec:
status:
  observedGeneration: 4

The new ReplicaSet:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:03Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 1

The old ReplicaSet:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:03Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
spec:
status:
  observedGeneration: 2

dashpole · 2021-07-14T14:21:00Z

For 1, we can use generation and observedGeneration to find out the correct trace contexts for propagation.

In some objects such as Deployment or ReplicaSet, for every spec update, the generation will be incremented by 1, we can make use of this feature to add a similar field "traceGeneration" in managedFieldEntry to track the created/updated trace contexts. When controller reconciles multiple actors' intents, we propagate the trace contexts that are traceGeneration > observedGeneration.

IIUC, this is essentially the generation of that managedFields entry, and isn't necessarily specific to tracing. It would allow you to answer the question: Which fields have been changed, and are not reflected in the status?

The only potential problems I can see are:

observedGeneration is a convention, not a requirement for objects. So it may work for the built-in objects, but we can't use it generically.
If there are multiple controllers that operate on an object and update status, only one of them generally updates observedGeneration. One example is something like a pod readiness gate, where a controller can perform "setup" for a pod, such as configuring network, and then updates a pod condition to indicate that portion of setup is complete.
You say we will "propagate the trace contexts". The w3c spec for trace context only allows for a single one to be propagated. Would we be defining our own standard for this?

fenggw-fnst · 2021-07-15T12:37:15Z

IIUC, this is essentially the generation of that managedFields entry, and isn't necessarily specific to tracing. It would allow you to answer the question: Which fields have been changed, and are not reflected in the status?

Yes, this way we can find out the proper managedFields entries, where the trace contexts are what we need to propagate for the controller reconcile.

observedGeneration is a convention, not a requirement for objects. So it may work for the built-in objects, but we can't use it generically.

Yes, this method doesn't work for those custom resources or operators that don't implement observedGeneration, but I'm not sure whether this is an obstacle for advancing this proposal. Without observedGeneration I don't know how to reliably figure out the intents that haven't been reflected...

If there are multiple controllers that operate on an object and update status, only one of them generally updates observedGeneration. One example is something like a pod readiness gate, where a controller can perform "setup" for a pod, such as configuring network, and then updates a pod condition to indicate that portion of setup is complete.

Sorry, maybe I haven’t understood your concern correctly, could you please explain it in more detail?

In this idea, before observedGeneration be updated to generation, we use the trace contexts that traceGeneration > observedGeneration.
For the reconcile that update observedGeneration to generation, we use the trace contexts that traceGeneration > observedGeneration.
After observedGeneration == generation, we use the newest timestamp trace context.

Do you mean that there are other situations conflict with the above?

You say we will "propagate the trace contexts". The w3c spec for trace context only allows for a single one to be propagated. Would we be defining our own standard for this?

No, actually we just use trace-ids as a kind of global request-ids for logging, here by "propagate the trace contexts" I mean that we transfer the trace contexts as a set via baggage to apiserver, implement our own inject and extract methods, and write to the target objects via golang context. Is it acceptable to use opentelemetry's framework without really doing tracing?

dashpole · 2021-07-15T12:57:44Z

Sorry, maybe I haven’t understood your concern correctly, could you please explain it in more detail?

Many objects have observedGeneration. But there isn't guaranteed to be a single controller operating on an object. If there are two controllers, A and B, and both look at the object's spec, and update different fields in the object's status, then observedGeneration will only be accurate for the controller that writes to observedGeneration. If A is the controller that updates observedGeneration, B might attach a different set of trace contexts depending on whether B operates before or after A updates observedGeneration.

by "propagate the trace contexts" I mean that we transfer the trace contexts as a set via baggage to apiserver, implement our own inject and extract methods, and write to the target objects via golang context.

This is just context propagation, and isn't necessarily about tracing. Are all trace contexts sent via baggage, or is one still sent via the trace context header?

fenggw-fnst · 2021-07-16T06:57:58Z

Many objects have observedGeneration. But there isn't guaranteed to be a single controller operating on an object. If there are two controllers, A and B, and both look at the object's spec, and update different fields in the object's status, then observedGeneration will only be accurate for the controller that writes to observedGeneration. If A is the controller that updates observedGeneration, B might attach a different set of trace contexts depending on whether B operates before or after A updates observedGeneration.

Ah, thanks, I see.
IMHO, although it depends on whether B operates before or after A updates observedGeneration, actually the trace contexts are the same.

If B operates before A updates observedGeneration, then we choose the trace contexts that traceGeneration > observedGeneration.
If B operates after A updates observedGeneration, then we choose the newest timestamp one, which is a set consisting of the trace contexts of 1.

This is just context propagation, and isn't necessarily about tracing. Are all trace contexts sent via baggage, or is one still sent via the trace context header?

Sorry, I may have been confused before, now I understand.
Yes, there should be one still sent via the trace context header. We use both the w3c spec trace context and the baggage. Since we have attached the trace contexts we need as a set, for the w3c trace context we just use any one randomly or the latest one.

fenggw-fnst · 2021-07-30T09:05:34Z

There are possible concerns about the traceGeneration idea:

The cost could be expensive if requiring an extra write for each reconcile loop.
Previously, we could just search logs for a single traceID string, but in order to solve the multi-writer problem, we decided to use managedFields, which allows multiple traceIDs in an object, now things got more complex, so we should figure out how one would reconstruct the "what happened for this request" flow.

For 1, I heard that there is another idea, which is to create an object to record the linked contexts, it works for the case that multiple traceIDs caused the reconcile, but it may also require extra write for each reconcile loop. In comparison, the traceGeneration idea, which aims for logging, simply propagate the trace contexts as a data set along managedFields mechanism, so I don't think this will cause too much cost.

For 2, in fact, things will not become more complex than before. Although there will be many traceIDs in an object, "which ones should be output to the log" depends on "which root request caused this log", I think the traceGeneration idea also works for this, i.e.: before reconcile we output the traceIDs that are traceGeneration > observedGeneration, after reconcile (means there is no traceGeneration bigger than observedGeneration), we just output the newest timestamp traceIDs. So that we can figure out "what happened for a request" by grepping its traceID, and also we can find out the root request when some errors occur.

fenggw-fnst · 2021-07-30T16:23:09Z

Here is an additional introduction on how the traceGeneration idea can be used for logging:

dashpole · 2021-08-16T15:58:58Z

A note on the traceGeneration idea:

When controller reconciles multiple actors' intents, we propagate the trace contexts that are traceGeneration > observedGeneration.

I'm not sure this always works. For example, a controller may fail to do something, but it still updates the observedGeneration in the status. The next reconcile will still be operating on the previous intent, so it wouldn't be correct to drop it.

Thanks for the diagrams. A few questions:

Is there a way to bound the number of trace contexts attached to a single object? Two seems reasonable, but twenty seems excessive.
Deployment -> replicaset -> pod forms a tree, with a single deployment makes multiple RS, which each make multiple pods. That means each action (e.g. create replicaset) is always the result of one object (e.g. the deployment). That makes your case simpler, because you only have to care about contexts from a single parent object. In cases where objects do not form a tree, it seems like we could easily get an explosion of contexts. E.g. kube-proxy, when it updates iptable rules on a node, does so based on all cluster ip services. In the worst case, you could have an inverse tree, where object A is based on multiple object Bs, each of which is based on multiple object Cs.

fenggw-fnst · 2021-08-18T10:04:32Z

Thanks for the review.

I'm not sure this always works. For example, a controller may fail to do something, but it still updates the observedGeneration in the status. The next reconcile will still be operating on the previous intent, so it wouldn't be correct to drop it.

The traceGeneration idea:

If there are traceGeneration > observedGeneration, then we propagate the trace contexts meet this condition.
If 1 is not satisfied, then we propagate the trace contexts with the latest timestamp.

In above example, 2 still works. If the controller fails to do something but still updates the observedGeneration, then this "status update" will propagate the trace contexts to the managedFieldsEntry that contains observedGeneration, which have the latest timestamp. In the next reconcile, those trace contexts can still be propagated correctly by 2.

Is there a way to bound the number of trace contexts attached to a single object? Two seems reasonable, but twenty seems excessive.

Yes, we can improve our method like this:
When controller reconcile propagates trace contexts between different managedFieldsEntry (the managedFields containing meta/spec or status) within a single object, do not copy them but move them. (clean up old ones)

This will ensure that:

Before controller reconcile, there are the newest trace contexts attached to those managedFieldsEntry that contain metadata or spec, each managedFieldsEntry only contains one trace context representing one request, the number depends on how many requests act on an object at one time.
After controller reconcile, there is only one traceContext slice containing above trace contexts in the managedFieldsEntry that contains status. (the old traceContext will be overwritten if exists)

So the number only depends on how many concurrent requests act on the different part of an object. Sorry for my lack of production-level experience but I guess in most cases there is only one request in a time. Even if multiple writers case happens, the number of requests should not be that many. Even the worst case, the upper limit should be the number of the fields that can be modified in this object.

Deployment -> replicaset -> pod forms a tree, with a single deployment makes multiple RS, which each make multiple pods. That means each action (e.g. create replicaset) is always the result of one object (e.g. the deployment). That makes your case simpler, because you only have to care about contexts from a single parent object. In cases where objects do not form a tree, it seems like we could easily get an explosion of contexts. E.g. kube-proxy, when it updates iptable rules on a node, does so based on all cluster ip services. In the worst case, you could have an inverse tree, where object A is based on multiple object Bs, each of which is based on multiple object Cs.

Thanks for pointing this out, I think the key point you said is that:

when propagating trace contexts across different resources, how we deal with the case of "controller reconcile is based on multiple objects" ?

we haven't considered this before, but I feel that this is out of scope of this proposal, since our main purpose is to track root request made by users in order to ease the investigation in log, so we mainly focus on the workload related resources that user usually act on.

fenggw-fnst · 2021-08-19T10:25:26Z

Is there a way to bound the number of trace contexts attached to a single object? Two seems reasonable, but twenty seems excessive.

Yes, we can improve our method like this:

When coding PoC, we found that if we can drop the old trace contexts while persisting the new ones into an object, then we don't even need to use traceGeneration or timestamps to determine the trace contexts that need to be propagated, because for every reconcile, all the trace contexts in the object need to be propagated.

ehashman · 2021-09-01T18:32:39Z

/unassign

dashpole · 2021-09-09T20:53:24Z

The traceGeneration idea:

If there are traceGeneration > observedGeneration, then we propagate the trace contexts meet this condition.

If 1 is not satisfied, then we propagate the trace contexts with the latest timestamp.

In above example, 2 still works. If the controller fails to do something but still updates the observedGeneration, then this "status update" will propagate the trace contexts to the managedFieldsEntry that contains observedGeneration, which have the latest timestamp. In the next reconcile, those trace contexts can still be propagated correctly by 2.

This essentially "falls back" to only storing a single trace context when something fails. E.g. if two concurrent updates, A and B, are done and fail, any subsequent retries will only be attributed to the last update. When things are failing is probably when it would be most important for me to have these links.

The other thing is that when something is failing to reconcile, it may fail for a while. E.g. if a service has an outage for 10 minutes, it will fail a few times. This increases the probability that something will happen "concurrently", because the time window over which something is reconciled is expanded.

When controller reconcile propagates trace contexts between different managedFieldsEntry (the managedFields containing meta/spec or status) within a single object, do not copy them but move them. (clean up old ones)

IIUC, this essentially amounts to de-duplicating trace contexts (i.e. not having a single trace context present in multiple managedfields entries), which is useful, but i'm primarily concerned about cases where we have a large number of unique trace contexts. This would be the case where a controller action takes multiple unique objects into account.

Even the worst case, the upper limit should be the number of the fields that can be modified in this object.

That is my conclusion as well. But I don't think thats an acceptable upper bound...

we haven't considered this before, but I feel that this is out of scope of this proposal, since our main purpose is to track root request made by users in order to ease the investigation in log, so we mainly focus on the workload related resources that user usually act on.

What is the scope then? Just a few objects (e.g. deployment, replicaset, pod)? A fairly substantial change like this is hard to justify for a small number of objects.

fenggw-fnst · 2021-09-14T09:04:11Z

This essentially "falls back" to only storing a single trace context when something fails. E.g. if two concurrent updates, A and B, are done and fail, any subsequent retries will only be attributed to the last update. When things are failing is probably when it would be most important for me to have these links.

Sorry, it seems that I didn’t explain it clearly before, it's not "falls back" to only storing a single trace context.

For example, a controller may fail to do something, but it still updates the observedGeneration in the status.
E.g. if two concurrent updates, A and B, are done and fail

After "updates the observedGeneration in the status", A and B should have already been stored as a slice "A, B" in another managedFields (a status managedFieldsEntry), the "any subsequent retries" will propagate "A, B" rather than B.

The other thing is that when something is failing to reconcile, it may fail for a while. E.g. if a service has an outage for 10 minutes, it will fail a few times. This increases the probability that something will happen "concurrently", because the time window over which something is reconciled is expanded.

Thanks for pointing this out, yes, the possibility of concurrent requests is high in some cases. When something happen "concurrently", we are able to propagated all of them correctly. No matter A and B are done and fail, or not, there will be only 2 cases:

A and B are still in the meta&spec managedFieldsEntry.
A and B have been stored in the status managedFieldsEntry as "A, B" (observedGeneration updated).

For 1, A and B will be propagated.
For 2, "A, B" will be propagated.

IIUC, this essentially amounts to de-duplicating trace contexts (i.e. not having a single trace context present in multiple managedfields entries), which is useful,

Yes, in addition to the dedupe, actually we also remove the stale ones (the trace contexts that has been moved to the status managedFieldsEntry by status update) when there are new intents come in (i.e. new trace contexts be attached to the meta&spec managedFieldsEntry), therefore a long-running object will tend to have only 1 trace context.

but i'm primarily concerned about cases where we have a large number of unique trace contexts. This would be the case where a controller action takes multiple unique objects into account.
That is my conclusion as well. But I don't think thats an acceptable upper bound...

Although the number may reach the upper bound at some time, new requests will flush the old trace contexts, so there will not always be so many trace contexts in an object. This is not a solution for the upper bound, but can mitigate the large number of unique trace contexts. IMHO the upper bound concern is inevitable as long as we attach to managedFields...

dashpole · 2021-09-20T19:44:00Z

After "updates the observedGeneration in the status", A and B should have already been stored as a slice "A, B" in another managedFields (a status managedFieldsEntry), the "any subsequent retries" will propagate "A, B" rather than B.

new requests will flush the old trace contexts, so there will not always be so many trace contexts in an object

These two seem contradictory to me. Either B will "flush" A (which may or may not be in-progress), or it won't. How do we determine if a new incoming trace ID will be added to the list of trace ids, or if it will replace the current list of trace ids on the object?

k8s-triage-robot · 2021-12-19T19:58:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

KobayashiD27 · 2021-12-20T01:13:13Z

/remove-lifecycle stale

k8s-triage-robot · 2022-03-20T02:08:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-04-19T02:15:14Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-05-19T02:50:05Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-05-19T02:50:20Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zhijianli88 and others added 6 commits January 21, 2021 14:27

copy template

671d9f6

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>

Draft a new KEP trace-popagating

076f677

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>

complete KEP for alpha

2e7e52a

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>

Update Goals

2799320

Signed-off-by: Guangwen Feng <fenggw-fnst@cn.fujitsu.com>

Update description

07c7a07

Signed-off-by: Guangwen Feng <fenggw-fnst@cn.fujitsu.com>

inspired by KEP#650

262c220

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 22, 2021

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 22, 2021

k8s-ci-robot requested review from ehashman and logicalhan January 22, 2021 03:22

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 22, 2021

hase1128 mentioned this pull request Jan 26, 2021

New unique logging meta-data into key Kubernetes logs #1668

Closed

k8s-ci-robot changed the title ~~Trace context popagating~~ Trace context propagation Jan 26, 2021

fenggw-fnst mentioned this pull request Jan 27, 2021

Propose log tracking KEP #1961

Closed

ehashman reviewed Jan 27, 2021

View reviewed changes

keps/sig-instrumentation/1668-trace-popagating/kep.yaml Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 27, 2021

fenggw-fnst mentioned this pull request Jan 28, 2021

Add SIG api-machinery to participating-sigs hase1128/enhancements#17

Merged

Add SIG api-machinery to participating-sigs (#17)

125d1d4

Signed-off-by: Guangwen Feng <fenggw-fnst@cn.fujitsu.com>

dashpole reviewed Jan 28, 2021

View reviewed changes

logicalhan reviewed Jan 28, 2021

View reviewed changes

keps/sig-instrumentation/1668-trace-popagating/README.md Outdated Show resolved Hide resolved

keps/sig-instrumentation/1668-trace-popagating/README.md Outdated Show resolved Hide resolved

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jan 28, 2021

k8s-ci-robot unassigned ehashman Sep 1, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2021

dgrisonnet mentioned this pull request Feb 8, 2022

track latency incurred in various layers of apiserver kubernetes/kubernetes#107910

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 20, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 19, 2022

k8s-ci-robot closed this May 19, 2022

dashpole mentioned this pull request Apr 1, 2024

PoC: object annotation trace context carrier kubernetes/kubernetes#124133

Closed

dashpole mentioned this pull request Dec 10, 2024

CI/CD: producing long running traces open-telemetry/semantic-conventions#1648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace context propagation #2312

Trace context propagation #2312

zhijianli88 commented Jan 22, 2021 •

edited

Loading

k8s-ci-robot commented Jan 22, 2021

k8s-ci-robot commented Jan 22, 2021

kikisdeliveryservice commented Jan 25, 2021

ehashman commented Jan 26, 2021

ehashman left a comment

ehashman commented Jan 27, 2021

dashpole Jan 28, 2021

zhijianli88 Jan 29, 2021

dashpole Feb 4, 2021 •

edited

Loading

fenggw-fnst Feb 7, 2021

ehashman commented Jan 28, 2021

fenggw-fnst commented Jul 14, 2021

dashpole commented Jul 14, 2021

fenggw-fnst commented Jul 15, 2021

dashpole commented Jul 15, 2021

fenggw-fnst commented Jul 16, 2021

fenggw-fnst commented Jul 30, 2021

fenggw-fnst commented Jul 30, 2021

dashpole commented Aug 16, 2021

fenggw-fnst commented Aug 18, 2021

fenggw-fnst commented Aug 19, 2021

ehashman commented Sep 1, 2021

dashpole commented Sep 9, 2021

fenggw-fnst commented Sep 14, 2021

dashpole commented Sep 20, 2021

k8s-triage-robot commented Dec 19, 2021

KobayashiD27 commented Dec 20, 2021

k8s-triage-robot commented Mar 20, 2022

k8s-triage-robot commented Apr 19, 2022

k8s-triage-robot commented May 19, 2022

k8s-ci-robot commented May 19, 2022

Trace context propagation #2312

Trace context propagation #2312

Conversation

zhijianli88 commented Jan 22, 2021 • edited Loading

k8s-ci-robot commented Jan 22, 2021

k8s-ci-robot commented Jan 22, 2021

kikisdeliveryservice commented Jan 25, 2021

ehashman commented Jan 26, 2021

ehashman left a comment

Choose a reason for hiding this comment

ehashman commented Jan 27, 2021

dashpole Jan 28, 2021

Choose a reason for hiding this comment

zhijianli88 Jan 29, 2021

Choose a reason for hiding this comment

dashpole Feb 4, 2021 • edited Loading

Choose a reason for hiding this comment

fenggw-fnst Feb 7, 2021

Choose a reason for hiding this comment

ehashman commented Jan 28, 2021

fenggw-fnst commented Jul 14, 2021

Trace context propagation example for creation:

Trace context propagation example for update:

dashpole commented Jul 14, 2021

fenggw-fnst commented Jul 15, 2021

dashpole commented Jul 15, 2021

fenggw-fnst commented Jul 16, 2021

fenggw-fnst commented Jul 30, 2021

fenggw-fnst commented Jul 30, 2021

dashpole commented Aug 16, 2021

fenggw-fnst commented Aug 18, 2021

fenggw-fnst commented Aug 19, 2021

ehashman commented Sep 1, 2021

dashpole commented Sep 9, 2021

fenggw-fnst commented Sep 14, 2021

dashpole commented Sep 20, 2021

k8s-triage-robot commented Dec 19, 2021

KobayashiD27 commented Dec 20, 2021

k8s-triage-robot commented Mar 20, 2022

k8s-triage-robot commented Apr 19, 2022

k8s-triage-robot commented May 19, 2022

k8s-ci-robot commented May 19, 2022

zhijianli88 commented Jan 22, 2021 •

edited

Loading

dashpole Feb 4, 2021 •

edited

Loading