EventedPLEG: Pass event created timestamp correctly to cache #124297

hshiina · 2024-04-12T18:20:04Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

CreatedAt timestamp of ContainerEventResponse should be passed as nanoseconds to time.Unix().

This fix may mitigate #123087. I guess this issue happens as following:
Usually, a container created event is delivered while a pod worker is creating a pod at SyncPod() and the created status is cached by the PLEG. After SyncPod(), the pod worker waits at GetNewerThan() until a next event (container started) happens because the cached container status is older than lastSyncTime. However, due to the bug in this PR, the modified time of the cached status is so much newer because a nanoseconds value is set as seconds. Then, the pod worker gets a created container status from the cache and goes into SyncPod() soon. Then, the worker creates another container.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

`CreatedAt` timestamp of `ContainerEventResponse` should be passed as nanoseconds to `time.Unix()`.

k8s-ci-robot · 2024-04-12T18:20:08Z

Please note that we're already in Test Freeze for the release-1.30 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.30.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Apr 12 13:49:11 UTC 2024.

k8s-ci-robot · 2024-04-12T18:20:12Z

Welcome @hshiina!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-04-12T18:20:13Z

Hi @hshiina. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bart0sh · 2024-04-13T15:03:38Z

/triage accepted
/priority important-longterm
/ok-to-test
/cc

bart0sh · 2024-04-13T15:50:40Z

pkg/kubelet/pleg/evented.go

@@ -264,7 +264,7 @@ func (e *EventedPLEG) processCRIEvents(containerEventsResponseCh chan *runtimeap
 			}
 			shouldSendPLEGEvent = true
 		} else {
-			if e.cache.Set(podID, status, err, time.Unix(event.GetCreatedAt(), 0)) {
+			if e.cache.Set(podID, status, err, time.Unix(0, event.GetCreatedAt())) {


@hshiina great catch! Thank you!
/lgtm

hshiina · 2024-08-26T09:53:20Z

/test pull-kubernetes-e2e-kind-evented-pleg
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

hshiina · 2024-08-26T11:22:17Z

/retest

HirazawaUi · 2024-09-08T13:55:02Z

pkg/kubelet/pleg/evented.go

@@ -125,7 +125,7 @@ func (e *EventedPLEG) Start() {
 	e.stopCh = make(chan struct{})
 	e.stopCacheUpdateCh = make(chan struct{})
 	go wait.Until(e.watchEventsChannel, 0, e.stopCh)
-	go wait.Until(e.updateGlobalCache, globalCacheUpdatePeriod, e.stopCacheUpdateCh)
+	go wait.Until(e.updateGlobalCache, e.globalCacheUpdatePeriod, e.stopCacheUpdateCh)


After we modify the timestamp from seconds to nanoseconds, do we really need to execute updateGlobalCache() at a faster interval?

updateGlobalCache() serves as a last resort for recovery from failures. We shouldn’t perform this action for any pod that can retrieve podStatus normally (because it is not the actual timestamp returned by the runtime),

The 5-second interval comes from the timeout set in NewRemoteRuntimeService. Unless we modify the timeout of NewRemoteRuntimeService to 1 second and assume that if the runtime fails to respond within 1 second, it should be considered in a failure state.

Then, we can use updateGlobalCache() for failure recovery.

Pod workers are expected to be unblocked from GetNewerThan() in one second at pod termination:

kubernetes/pkg/kubelet/pod_workers.go

Lines 1247 to 1256 in f5c5384

default:

// wait until we see the next refresh from the PLEG via the cache (max 2s)

// TODO: this adds ~1s of latency on all transitions from sync to terminating

// to terminated, and on all termination retries (including evictions). We should

// improve latency by making the pleg continuous and by allowing pod status

// changes to be refreshed when key events happen (killPod, sync->terminating).

// Improving this latency also reduces the possibility that a terminated

// container's status is garbage collected before we have a chance to update the

// API server (thus losing the exit code).

status, err = p.podCache.GetNewerThan(update.Options.Pod.UID, lastSyncTime)

Generic PLEG also updates the global cache every second:

kubernetes/pkg/kubelet/pleg/generic.go

Line 356 in f5c5384

g.cache.UpdateTime(timestamp)

However, it might better to remove GetNewerThan() from podWorkerLoop() as mentioned at #124704 (comment). I wonder if I would revert the latter commit in order to fix the apparent bug at first.

Generic PLEG also updates the global cache every second:

This might be because, at this point, we have ensured that all pods in the cache have been correctly updated.

@yujuhong, is my understanding correct? If not, please correct me.

However, for the current updateGlobalCache(), we don’t have such a guarantee, so the same strategy shouldn’t be applied.

However, it might better to remove GetNewerThan() from podWorkerLoop() as mentioned at #124704 (comment). I wonder if I would revert the latter commit in order to fix the apparent bug at first.

You might only fix the timestamp. A more consistent approach would be to also fix:

kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go

Lines 1583 to 1589 in f5c5384

// Get the statuses of all containers visible to the pod and

// timestamp from sandboxStatus.

timestamp = time.Unix(resp.Timestamp, 0)

for _, cs := range resp.ContainersStatuses {

cStatus := m.convertToKubeContainerStatus(cs)

containerStatuses = append(containerStatuses, cStatus)

}

You might want to only fix the timestamp. A more consistent approach would be to also fix:

kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go

Lines 1583 to 1589 in f5c5384

// Get the statuses of all containers visible to the pod and

// timestamp from sandboxStatus.

timestamp = time.Unix(resp.Timestamp, 0)

for _, cs := range resp.ContainersStatuses {

cStatus := m.convertToKubeContainerStatus(cs)

containerStatuses = append(containerStatuses, cStatus)

}

This looks like a different issue (#126414) to me because the runtime (CRI-O) sets this timestamp in seconds. This should be fixed in both kubernetes and CRI-O synchronously.

Making this change does not affect our existing behavior. Referring to the following code, given a unix timestamp in seconds, whether it is kept in seconds or converted to nanoseconds, the behavior will remain consistent.

func main() { if !compare1() && !compare2() { fmt.Printf("Their behavior is consistent") } } func compare2() bool { eventedTime := time.Now().UnixNano() time.Sleep(100 * time.Millisecond) // The initial timestamp is seconds genericTime := time.Now().Unix() cachedEventedTime := time.Unix(0, eventedTime) // Convert to nanoseconds cachedGenericTime := time.Unix(0, genericTime) if cachedEventedTime.Before(cachedGenericTime) { fmt.Printf("Expected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime) return true } else { fmt.Printf("Unexpected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime) return false } } func compare1() bool { eventedTime := time.Now().UnixNano() time.Sleep(100 * time.Millisecond) // The initial timestamp is seconds genericTime := time.Now().Unix() cachedEventedTime := time.Unix(0, eventedTime) // Keep in seconds cachedGenericTime := time.Unix(genericTime, 0) if cachedEventedTime.Before(cachedGenericTime) { fmt.Printf("Expected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime) return true } else { fmt.Printf("Unexpected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime) return false } }

Therefore, we can first fix kubelet and then fix CRI-O to ensure that this functionality works properly in the end.

it looks to me that CRIO returns nano: https://github.com/cri-o/cri-o/blob/36c2750916eb9e75f01a627463c2cfad207f192d/server/server.go#L934 ? Am I missing something?

The PodSandboxStatus interface in CRI-O returns the timestamp in seconds, and kubelet uses this timestamp in genericPLEG.

crio:
https://github.com/cri-o/cri-o/blob/11934304d11256189df8e316e195af6650a472aa/server/sandbox_status.go#L43-L48

kubelet:

kubernetes/pkg/kubelet/pleg/generic.go

Lines 445 to 482 in 342da50

status, err := g.runtime.GetPodStatus(ctx, pod.ID, pod.Name, pod.Namespace)

if err != nil {

// nolint:logcheck // Not using the result of klog.V inside the

// if branch is okay, we just use it to determine whether the

// additional "podStatus" key and its value should be added.

if klog.V(6).Enabled() {

klog.ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status)

} else {

klog.ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name))

}

} else {

if klogV := klog.V(6); klogV.Enabled() {

klogV.InfoS("PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status)

} else {

klog.V(4).InfoS("PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name))

}

// Preserve the pod IP across cache updates if the new IP is empty.

// When a pod is torn down, kubelet may race with PLEG and retrieve

// a pod status after network teardown, but the kubernetes API expects

// the completed pod's IP to be available after the pod is dead.

status.IPs = g.getPodIPs(pid, status)

}

// When we use Generic PLEG only, the PodStatus is saved in the cache without

// any validation of the existing status against the current timestamp.

// This works well when there is only Generic PLEG setting the PodStatus in the cache however,

// if we have multiple entities, such as Evented PLEG, while trying to set the PodStatus in the

// cache we may run into the racy timestamps given each of them were to calculate the timestamps

// in their respective execution flow. While Generic PLEG calculates this timestamp and gets

// the PodStatus, we can only calculate the corresponding timestamp in

// Evented PLEG after the event has been received by the Kubelet.

// For more details refer to:

// https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3386-kubelet-evented-pleg#timestamp-of-the-pod-status

if utilfeature.DefaultFeatureGate.Enabled(features.EventedPLEG) && isEventedPLEGInUse() && status != nil {

timestamp = status.TimeStamp

}

return err, g.cache.Set(pod.ID, status, err, timestamp)

I have already filed an issue in the CRI-O project, and it should be fixed soon, ref: cri-o/cri-o#8580

This reverts commit 7aa7062.

This reverts commit dbc4734.

hshiina · 2024-09-10T08:18:51Z

Please let me start over.

/test pull-kubernetes-e2e-kind-evented-pleg
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

k8s-ci-robot · 2024-09-10T09:03:54Z

@hshiina: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e	`0289216`	link	false	`/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hshiina · 2024-09-10T10:28:03Z

Now, only Evented PLEG caching is being fixed.

The jobs with EventedPLEG enabled were almost passed because this fix mitigates #123087. (Conversely, this would make it difficult to reproduce #123087.)

In addition, another issue appeared with by fix. There was a failure in the job:

E2eNode Suite: [It] [sig-node] [NodeFeature:SidecarContainers] Containers Lifecycle [It] should not hang in termination if terminated during initialization [sig-node, NodeFeature:SidecarContainers]
{ failed [FAILED] should delete in < 5 seconds, took 10.071795
Expected
    <float64>: 10.071794756
to be <
    <int64>: 5
In [It] at: k8s.io/kubernetes/test/e2e_node/container_lifecycle_test.go:3470 @ 09/10/24 08:54:31.819
}

With this fix, a pod worker is blocked at GetNewerThan():

kubernetes/pkg/kubelet/pod_workers.go

Line 1256 in d913914

status, err = p.podCache.GetNewerThan(update.Options.Pod.UID, lastSyncTime)

We can see a pod worker blocked at pod termination in kubelet.log. This caused the failure:

I0910 08:54:21.758046	4615 kubelet.go:2423] "SyncLoop DELETE" source="api" pods=["containers-lifecycle-test-6189/dont-hang-if-terminated-in-init"]
I0910 08:54:21.758096	4615 pod_workers.go:854] "Pod is marked for graceful deletion, begin teardown" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" updateType="update"
I0910 08:54:21.758123	4615 pod_workers.go:963] "Notifying pod of pending update" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" workType="terminating"
I0910 08:54:21.758145	4615 pod_workers.go:970] "Cancelling current pod sync" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" workType="terminating"
I0910 08:54:21.758180	4615 pod_workers.go:1233] "Processing pod event" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" updateType="terminating"

<The pod worker was being blocked at `GetNewerThan()`>

I0910 08:54:25.234136    4615 kubelet.go:2006] "SyncTerminatingPod enter" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7"

<snip: The pod worker ran `SyncTermninatingPod()`>

I0910 08:54:25.736947	4615 kubelet.go:2093] "SyncTerminatingPod exit" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7"
I0910 08:54:25.736961	4615 pod_workers.go:1402] "Pod terminated all containers successfully" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7"
I0910 08:54:25.736962	4615 status_manager.go:227] "Syncing updated statuses"
I0910 08:54:25.737009	4615 pod_workers.go:1513] "Pending update already queued" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7"
I0910 08:54:25.737031	4615 pod_workers.go:1338] "Processing pod event done" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" updateType="terminating"
I0910 08:54:25.737049	4615 pod_workers.go:1233] "Processing pod event" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" updateType="terminated"

<The pod worker was being blocked at `GetNewerThan()`>

I0910 08:54:30.235579	4615 kubelet.go:2151] "SyncTerminatedPod enter" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7"

This issue is addressed in #124704. I shouldn't have mixed it in this PR. It would be better to fix it separately.

HirazawaUi · 2024-09-10T16:08:00Z

Due to the rapid action of the CRI-O team, this bug has already been fixed and backported to older versions, which means you can apply the fix.

kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go

Lines 1583 to 1589 in f5c5384

    
           // Get the statuses of all containers visible to the pod and 
        
           // timestamp from sandboxStatus. 
        
           timestamp = time.Unix(resp.Timestamp, 0) 
        
           for _, cs := range resp.ContainersStatuses { 
        
           	cStatus := m.convertToKubeContainerStatus(cs) 
        
           	containerStatuses = append(containerStatuses, cStatus) 
        
           }

ref: cri-o/cri-o#8582

SergeyKanzhelev

/lgtm
/approve

thank you!

k8s-ci-robot · 2024-09-10T16:34:06Z

LGTM label has been added.

Git tree hash: 52a94895c6f7cb0c2e354b53094fa69dab14a195

k8s-ci-robot · 2024-09-10T16:34:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hshiina, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [SergeyKanzhelev]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SergeyKanzhelev · 2024-09-10T17:31:01Z

/retest

Pass event created timestamp correctly to cache

7612858

`CreatedAt` timestamp of `ContainerEventResponse` should be passed as nanoseconds to `time.Unix()`.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 12, 2024

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 12, 2024

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 12, 2024

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 12, 2024

k8s-ci-robot requested review from dchen1107 and krmayankk April 12, 2024 18:20

k8s-ci-robot requested a review from bart0sh April 13, 2024 15:03

bart0sh reviewed Apr 13, 2024

View reviewed changes

k8s-ci-robot assigned bart0sh Apr 13, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 13, 2024

Address feedback

7aa7062

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 26, 2024

hshiina mentioned this pull request Sep 5, 2024

fix evented pleg global pod cache update issue #127086

Closed

HirazawaUi mentioned this pull request Sep 6, 2024

EventedPLEG: Set Timestamp in PodStatus for Generic PLEG more accurate #127195

Merged

hshiina changed the title ~~Pass event created timestamp correctly to cache~~ EventedPLEG: Pass event created timestamp correctly to cache Sep 8, 2024

HirazawaUi reviewed Sep 8, 2024

View reviewed changes

hshiina added 2 commits September 10, 2024 10:15

Revert "Address feedback"

05605d7

This reverts commit 7aa7062.

Revert "Update global cache timestamp more frequently"

0289216

This reverts commit dbc4734.

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 10, 2024

SergeyKanzhelev approved these changes Sep 10, 2024

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Sep 10, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2024

k8s-ci-robot merged commit 386a6af into kubernetes:master Sep 10, 2024
15 of 16 checks passed

k8s-ci-robot added this to the v1.32 milestone Sep 10, 2024

pacoxu mentioned this pull request Sep 10, 2024

Kubelet Evented PLEG for Better Performance kubernetes/enhancements#3386

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EventedPLEG: Pass event created timestamp correctly to cache #124297

EventedPLEG: Pass event created timestamp correctly to cache #124297

hshiina commented Apr 12, 2024 •

edited

Loading

k8s-ci-robot commented Apr 12, 2024

k8s-ci-robot commented Apr 12, 2024

k8s-ci-robot commented Apr 12, 2024

bart0sh commented Apr 13, 2024

bart0sh Apr 13, 2024

hshiina commented Aug 26, 2024

hshiina commented Aug 26, 2024

HirazawaUi Sep 8, 2024

hshiina Sep 8, 2024

HirazawaUi Sep 8, 2024 •

edited

Loading

hshiina Sep 8, 2024

HirazawaUi Sep 9, 2024

SergeyKanzhelev Sep 9, 2024

HirazawaUi Sep 10, 2024

hshiina commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

hshiina commented Sep 10, 2024

HirazawaUi commented Sep 10, 2024

SergeyKanzhelev left a comment

k8s-ci-robot commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

SergeyKanzhelev commented Sep 10, 2024

	default:
	// wait until we see the next refresh from the PLEG via the cache (max 2s)
	// TODO: this adds ~1s of latency on all transitions from sync to terminating
	// to terminated, and on all termination retries (including evictions). We should
	// improve latency by making the pleg continuous and by allowing pod status
	// changes to be refreshed when key events happen (killPod, sync->terminating).
	// Improving this latency also reduces the possibility that a terminated
	// container's status is garbage collected before we have a chance to update the
	// API server (thus losing the exit code).
	status, err = p.podCache.GetNewerThan(update.Options.Pod.UID, lastSyncTime)

	// Get the statuses of all containers visible to the pod and
	// timestamp from sandboxStatus.
	timestamp = time.Unix(resp.Timestamp, 0)
	for _, cs := range resp.ContainersStatuses {
	cStatus := m.convertToKubeContainerStatus(cs)
	containerStatuses = append(containerStatuses, cStatus)
	}

	status, err := g.runtime.GetPodStatus(ctx, pod.ID, pod.Name, pod.Namespace)
	if err != nil {
	// nolint:logcheck // Not using the result of klog.V inside the
	// if branch is okay, we just use it to determine whether the
	// additional "podStatus" key and its value should be added.
	if klog.V(6).Enabled() {
	klog.ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status)
	} else {
	klog.ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name))
	}
	} else {
	if klogV := klog.V(6); klogV.Enabled() {
	klogV.InfoS("PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status)
	} else {
	klog.V(4).InfoS("PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name))
	}
	// Preserve the pod IP across cache updates if the new IP is empty.
	// When a pod is torn down, kubelet may race with PLEG and retrieve
	// a pod status after network teardown, but the kubernetes API expects
	// the completed pod's IP to be available after the pod is dead.
	status.IPs = g.getPodIPs(pid, status)
	}

	// When we use Generic PLEG only, the PodStatus is saved in the cache without
	// any validation of the existing status against the current timestamp.
	// This works well when there is only Generic PLEG setting the PodStatus in the cache however,
	// if we have multiple entities, such as Evented PLEG, while trying to set the PodStatus in the
	// cache we may run into the racy timestamps given each of them were to calculate the timestamps
	// in their respective execution flow. While Generic PLEG calculates this timestamp and gets
	// the PodStatus, we can only calculate the corresponding timestamp in
	// Evented PLEG after the event has been received by the Kubelet.
	// For more details refer to:
	// https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3386-kubelet-evented-pleg#timestamp-of-the-pod-status
	if utilfeature.DefaultFeatureGate.Enabled(features.EventedPLEG) && isEventedPLEGInUse() && status != nil {
	timestamp = status.TimeStamp
	}

	return err, g.cache.Set(pod.ID, status, err, timestamp)

EventedPLEG: Pass event created timestamp correctly to cache #124297

EventedPLEG: Pass event created timestamp correctly to cache #124297

Conversation

hshiina commented Apr 12, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Apr 12, 2024

k8s-ci-robot commented Apr 12, 2024

k8s-ci-robot commented Apr 12, 2024

bart0sh commented Apr 13, 2024

bart0sh Apr 13, 2024

Choose a reason for hiding this comment

hshiina commented Aug 26, 2024

hshiina commented Aug 26, 2024

HirazawaUi Sep 8, 2024

Choose a reason for hiding this comment

hshiina Sep 8, 2024

Choose a reason for hiding this comment

HirazawaUi Sep 8, 2024 • edited Loading

Choose a reason for hiding this comment

hshiina Sep 8, 2024

Choose a reason for hiding this comment

HirazawaUi Sep 9, 2024

Choose a reason for hiding this comment

SergeyKanzhelev Sep 9, 2024

Choose a reason for hiding this comment

HirazawaUi Sep 10, 2024

Choose a reason for hiding this comment

hshiina commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

hshiina commented Sep 10, 2024

HirazawaUi commented Sep 10, 2024

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 10, 2024

k8s-ci-robot commented Sep 10, 2024

SergeyKanzhelev commented Sep 10, 2024

hshiina commented Apr 12, 2024 •

edited

Loading

HirazawaUi Sep 8, 2024 •

edited

Loading