Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventedPLEG: Pass event created timestamp correctly to cache #124297

Merged
merged 5 commits into from
Sep 10, 2024

Conversation

hshiina
Copy link
Contributor

@hshiina hshiina commented Apr 12, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

CreatedAt timestamp of ContainerEventResponse should be passed as nanoseconds to time.Unix().

This fix may mitigate #123087. I guess this issue happens as following:
Usually, a container created event is delivered while a pod worker is creating a pod at SyncPod() and the created status is cached by the PLEG. After SyncPod(), the pod worker waits at GetNewerThan() until a next event (container started) happens because the cached container status is older than lastSyncTime. However, due to the bug in this PR, the modified time of the cached status is so much newer because a nanoseconds value is set as seconds. Then, the pod worker gets a created container status from the cache and goes into SyncPod() soon. Then, the worker creates another container.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


`CreatedAt` timestamp of `ContainerEventResponse` should be passed as
nanoseconds to `time.Unix()`.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 12, 2024
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.30 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.30.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Apr 12 13:49:11 UTC 2024.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 12, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @hshiina!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 12, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @hshiina. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 12, 2024
@bart0sh
Copy link
Contributor

bart0sh commented Apr 13, 2024

/triage accepted
/priority important-longterm
/ok-to-test
/cc

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 13, 2024
@@ -264,7 +264,7 @@ func (e *EventedPLEG) processCRIEvents(containerEventsResponseCh chan *runtimeap
}
shouldSendPLEGEvent = true
} else {
if e.cache.Set(podID, status, err, time.Unix(event.GetCreatedAt(), 0)) {
if e.cache.Set(podID, status, err, time.Unix(0, event.GetCreatedAt())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hshiina great catch! Thank you!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 13, 2024
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 26, 2024
@hshiina
Copy link
Contributor Author

hshiina commented Aug 26, 2024

/test pull-kubernetes-e2e-kind-evented-pleg
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

@hshiina
Copy link
Contributor Author

hshiina commented Aug 26, 2024

/retest

@hshiina hshiina changed the title Pass event created timestamp correctly to cache EventedPLEG: Pass event created timestamp correctly to cache Sep 8, 2024
@@ -125,7 +125,7 @@ func (e *EventedPLEG) Start() {
e.stopCh = make(chan struct{})
e.stopCacheUpdateCh = make(chan struct{})
go wait.Until(e.watchEventsChannel, 0, e.stopCh)
go wait.Until(e.updateGlobalCache, globalCacheUpdatePeriod, e.stopCacheUpdateCh)
go wait.Until(e.updateGlobalCache, e.globalCacheUpdatePeriod, e.stopCacheUpdateCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we modify the timestamp from seconds to nanoseconds, do we really need to execute updateGlobalCache() at a faster interval?

updateGlobalCache() serves as a last resort for recovery from failures. We shouldn’t perform this action for any pod that can retrieve podStatus normally (because it is not the actual timestamp returned by the runtime),

The 5-second interval comes from the timeout set in NewRemoteRuntimeService. Unless we modify the timeout of NewRemoteRuntimeService to 1 second and assume that if the runtime fails to respond within 1 second, it should be considered in a failure state.

Then, we can use updateGlobalCache() for failure recovery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pod workers are expected to be unblocked from GetNewerThan() in one second at pod termination:

default:
// wait until we see the next refresh from the PLEG via the cache (max 2s)
// TODO: this adds ~1s of latency on all transitions from sync to terminating
// to terminated, and on all termination retries (including evictions). We should
// improve latency by making the pleg continuous and by allowing pod status
// changes to be refreshed when key events happen (killPod, sync->terminating).
// Improving this latency also reduces the possibility that a terminated
// container's status is garbage collected before we have a chance to update the
// API server (thus losing the exit code).
status, err = p.podCache.GetNewerThan(update.Options.Pod.UID, lastSyncTime)

Generic PLEG also updates the global cache every second:

g.cache.UpdateTime(timestamp)

However, it might better to remove GetNewerThan() from podWorkerLoop() as mentioned at #124704 (comment). I wonder if I would revert the latter commit in order to fix the apparent bug at first.

Copy link
Contributor

@HirazawaUi HirazawaUi Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic PLEG also updates the global cache every second:

This might be because, at this point, we have ensured that all pods in the cache have been correctly updated.

@yujuhong, is my understanding correct? If not, please correct me.

However, for the current updateGlobalCache(), we don’t have such a guarantee, so the same strategy shouldn’t be applied.

However, it might better to remove GetNewerThan() from podWorkerLoop() as mentioned at #124704 (comment). I wonder if I would revert the latter commit in order to fix the apparent bug at first.

You might only fix the timestamp. A more consistent approach would be to also fix:

// Get the statuses of all containers visible to the pod and
// timestamp from sandboxStatus.
timestamp = time.Unix(resp.Timestamp, 0)
for _, cs := range resp.ContainersStatuses {
cStatus := m.convertToKubeContainerStatus(cs)
containerStatuses = append(containerStatuses, cStatus)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to only fix the timestamp. A more consistent approach would be to also fix:

// Get the statuses of all containers visible to the pod and
// timestamp from sandboxStatus.
timestamp = time.Unix(resp.Timestamp, 0)
for _, cs := range resp.ContainersStatuses {
cStatus := m.convertToKubeContainerStatus(cs)
containerStatuses = append(containerStatuses, cStatus)
}

This looks like a different issue (#126414) to me because the runtime (CRI-O) sets this timestamp in seconds. This should be fixed in both kubernetes and CRI-O synchronously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this change does not affect our existing behavior. Referring to the following code, given a unix timestamp in seconds, whether it is kept in seconds or converted to nanoseconds, the behavior will remain consistent.

func main() {
	if !compare1() && !compare2() {
		fmt.Printf("Their behavior is consistent")
	}
}

func compare2() bool {
	eventedTime := time.Now().UnixNano()
	time.Sleep(100 * time.Millisecond)
	// The initial timestamp is seconds
	genericTime := time.Now().Unix()

	cachedEventedTime := time.Unix(0, eventedTime)
	// Convert to nanoseconds
	cachedGenericTime := time.Unix(0, genericTime)

	if cachedEventedTime.Before(cachedGenericTime) {
		fmt.Printf("Expected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime)
		return true
	} else {
		fmt.Printf("Unexpected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime)
		return false
	}
}

func compare1() bool {
	eventedTime := time.Now().UnixNano()
	time.Sleep(100 * time.Millisecond)
	// The initial timestamp is seconds
	genericTime := time.Now().Unix()

	cachedEventedTime := time.Unix(0, eventedTime)
	// Keep in seconds
	cachedGenericTime := time.Unix(genericTime, 0)

	if cachedEventedTime.Before(cachedGenericTime) {
		fmt.Printf("Expected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime)
		return true
	} else {
		fmt.Printf("Unexpected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime)
		return false
	}
}

Therefore, we can first fix kubelet and then fix CRI-O to ensure that this functionality works properly in the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks to me that CRIO returns nano: https://github.com/cri-o/cri-o/blob/36c2750916eb9e75f01a627463c2cfad207f192d/server/server.go#L934 ? Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PodSandboxStatus interface in CRI-O returns the timestamp in seconds, and kubelet uses this timestamp in genericPLEG.

crio:
https://github.com/cri-o/cri-o/blob/11934304d11256189df8e316e195af6650a472aa/server/sandbox_status.go#L43-L48

kubelet:

status, err := g.runtime.GetPodStatus(ctx, pod.ID, pod.Name, pod.Namespace)
if err != nil {
// nolint:logcheck // Not using the result of klog.V inside the
// if branch is okay, we just use it to determine whether the
// additional "podStatus" key and its value should be added.
if klog.V(6).Enabled() {
klog.ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status)
} else {
klog.ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name))
}
} else {
if klogV := klog.V(6); klogV.Enabled() {
klogV.InfoS("PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status)
} else {
klog.V(4).InfoS("PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name))
}
// Preserve the pod IP across cache updates if the new IP is empty.
// When a pod is torn down, kubelet may race with PLEG and retrieve
// a pod status after network teardown, but the kubernetes API expects
// the completed pod's IP to be available after the pod is dead.
status.IPs = g.getPodIPs(pid, status)
}
// When we use Generic PLEG only, the PodStatus is saved in the cache without
// any validation of the existing status against the current timestamp.
// This works well when there is only Generic PLEG setting the PodStatus in the cache however,
// if we have multiple entities, such as Evented PLEG, while trying to set the PodStatus in the
// cache we may run into the racy timestamps given each of them were to calculate the timestamps
// in their respective execution flow. While Generic PLEG calculates this timestamp and gets
// the PodStatus, we can only calculate the corresponding timestamp in
// Evented PLEG after the event has been received by the Kubelet.
// For more details refer to:
// https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3386-kubelet-evented-pleg#timestamp-of-the-pod-status
if utilfeature.DefaultFeatureGate.Enabled(features.EventedPLEG) && isEventedPLEGInUse() && status != nil {
timestamp = status.TimeStamp
}
return err, g.cache.Set(pod.ID, status, err, timestamp)

I have already filed an issue in the CRI-O project, and it should be fixed soon, ref: cri-o/cri-o#8580

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 10, 2024
@hshiina
Copy link
Contributor Author

hshiina commented Sep 10, 2024

Please let me start over.

/test pull-kubernetes-e2e-kind-evented-pleg
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

@k8s-ci-robot
Copy link
Contributor

@hshiina: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e 0289216 link false /test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hshiina
Copy link
Contributor Author

hshiina commented Sep 10, 2024

Now, only Evented PLEG caching is being fixed.

The jobs with EventedPLEG enabled were almost passed because this fix mitigates #123087. (Conversely, this would make it difficult to reproduce #123087.)

In addition, another issue appeared with by fix. There was a failure in the job:

E2eNode Suite: [It] [sig-node] [NodeFeature:SidecarContainers] Containers Lifecycle [It] should not hang in termination if terminated during initialization [sig-node, NodeFeature:SidecarContainers]
{ failed [FAILED] should delete in < 5 seconds, took 10.071795
Expected
    <float64>: 10.071794756
to be <
    <int64>: 5
In [It] at: k8s.io/kubernetes/test/e2e_node/container_lifecycle_test.go:3470 @ 09/10/24 08:54:31.819
}

With this fix, a pod worker is blocked at GetNewerThan():

status, err = p.podCache.GetNewerThan(update.Options.Pod.UID, lastSyncTime)

We can see a pod worker blocked at pod termination in kubelet.log. This caused the failure:

I0910 08:54:21.758046	4615 kubelet.go:2423] "SyncLoop DELETE" source="api" pods=["containers-lifecycle-test-6189/dont-hang-if-terminated-in-init"]
I0910 08:54:21.758096	4615 pod_workers.go:854] "Pod is marked for graceful deletion, begin teardown" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" updateType="update"
I0910 08:54:21.758123	4615 pod_workers.go:963] "Notifying pod of pending update" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" workType="terminating"
I0910 08:54:21.758145	4615 pod_workers.go:970] "Cancelling current pod sync" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" workType="terminating"
I0910 08:54:21.758180	4615 pod_workers.go:1233] "Processing pod event" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" updateType="terminating"

<The pod worker was being blocked at `GetNewerThan()`>

I0910 08:54:25.234136    4615 kubelet.go:2006] "SyncTerminatingPod enter" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7"

<snip: The pod worker ran `SyncTermninatingPod()`>

I0910 08:54:25.736947	4615 kubelet.go:2093] "SyncTerminatingPod exit" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7"
I0910 08:54:25.736961	4615 pod_workers.go:1402] "Pod terminated all containers successfully" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7"
I0910 08:54:25.736962	4615 status_manager.go:227] "Syncing updated statuses"
I0910 08:54:25.737009	4615 pod_workers.go:1513] "Pending update already queued" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7"
I0910 08:54:25.737031	4615 pod_workers.go:1338] "Processing pod event done" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" updateType="terminating"
I0910 08:54:25.737049	4615 pod_workers.go:1233] "Processing pod event" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7" updateType="terminated"

<The pod worker was being blocked at `GetNewerThan()`>

I0910 08:54:30.235579	4615 kubelet.go:2151] "SyncTerminatedPod enter" pod="containers-lifecycle-test-6189/dont-hang-if-terminated-in-init" podUID="41a59818-90ea-4b92-8b33-92b9afda42f7"

This issue is addressed in #124704. I shouldn't have mixed it in this PR. It would be better to fix it separately.

@HirazawaUi
Copy link
Contributor

Due to the rapid action of the CRI-O team, this bug has already been fixed and backported to older versions, which means you can apply the fix.

// Get the statuses of all containers visible to the pod and
// timestamp from sandboxStatus.
timestamp = time.Unix(resp.Timestamp, 0)
for _, cs := range resp.ContainersStatuses {
cStatus := m.convertToKubeContainerStatus(cs)
containerStatuses = append(containerStatuses, cStatus)
}

ref: cri-o/cri-o#8582

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

thank you!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 52a94895c6f7cb0c2e354b53094fa69dab14a195

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hshiina, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2024
@SergeyKanzhelev
Copy link
Member

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging this pull request may close these issues.