-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EventedPLEG: Pass event created timestamp correctly to cache #124297
Conversation
`CreatedAt` timestamp of `ContainerEventResponse` should be passed as nanoseconds to `time.Unix()`.
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Apr 12 13:49:11 UTC 2024. |
Welcome @hshiina! |
Hi @hshiina. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/triage accepted |
@@ -264,7 +264,7 @@ func (e *EventedPLEG) processCRIEvents(containerEventsResponseCh chan *runtimeap | |||
} | |||
shouldSendPLEGEvent = true | |||
} else { | |||
if e.cache.Set(podID, status, err, time.Unix(event.GetCreatedAt(), 0)) { | |||
if e.cache.Set(podID, status, err, time.Unix(0, event.GetCreatedAt())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hshiina great catch! Thank you!
/lgtm
/test pull-kubernetes-e2e-kind-evented-pleg |
/retest |
pkg/kubelet/pleg/evented.go
Outdated
@@ -125,7 +125,7 @@ func (e *EventedPLEG) Start() { | |||
e.stopCh = make(chan struct{}) | |||
e.stopCacheUpdateCh = make(chan struct{}) | |||
go wait.Until(e.watchEventsChannel, 0, e.stopCh) | |||
go wait.Until(e.updateGlobalCache, globalCacheUpdatePeriod, e.stopCacheUpdateCh) | |||
go wait.Until(e.updateGlobalCache, e.globalCacheUpdatePeriod, e.stopCacheUpdateCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After we modify the timestamp from seconds to nanoseconds, do we really need to execute updateGlobalCache()
at a faster interval?
updateGlobalCache()
serves as a last resort for recovery from failures. We shouldn’t perform this action for any pod that can retrieve podStatus normally (because it is not the actual timestamp returned by the runtime),
The 5-second interval comes from the timeout set in NewRemoteRuntimeService. Unless we modify the timeout of NewRemoteRuntimeService
to 1 second and assume that if the runtime fails to respond within 1 second, it should be considered in a failure state.
Then, we can use updateGlobalCache()
for failure recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod workers are expected to be unblocked from GetNewerThan()
in one second at pod termination:
kubernetes/pkg/kubelet/pod_workers.go
Lines 1247 to 1256 in f5c5384
default: | |
// wait until we see the next refresh from the PLEG via the cache (max 2s) | |
// TODO: this adds ~1s of latency on all transitions from sync to terminating | |
// to terminated, and on all termination retries (including evictions). We should | |
// improve latency by making the pleg continuous and by allowing pod status | |
// changes to be refreshed when key events happen (killPod, sync->terminating). | |
// Improving this latency also reduces the possibility that a terminated | |
// container's status is garbage collected before we have a chance to update the | |
// API server (thus losing the exit code). | |
status, err = p.podCache.GetNewerThan(update.Options.Pod.UID, lastSyncTime) |
Generic PLEG also updates the global cache every second:
kubernetes/pkg/kubelet/pleg/generic.go
Line 356 in f5c5384
g.cache.UpdateTime(timestamp) |
However, it might better to remove GetNewerThan()
from podWorkerLoop()
as mentioned at #124704 (comment). I wonder if I would revert the latter commit in order to fix the apparent bug at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic PLEG also updates the global cache every second:
This might be because, at this point, we have ensured that all pods in the cache have been correctly updated.
@yujuhong, is my understanding correct? If not, please correct me.
However, for the current updateGlobalCache(), we don’t have such a guarantee, so the same strategy shouldn’t be applied.
However, it might better to remove
GetNewerThan()
frompodWorkerLoop()
as mentioned at #124704 (comment). I wonder if I would revert the latter commit in order to fix the apparent bug at first.
You might only fix the timestamp. A more consistent approach would be to also fix:
kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go
Lines 1583 to 1589 in f5c5384
// Get the statuses of all containers visible to the pod and | |
// timestamp from sandboxStatus. | |
timestamp = time.Unix(resp.Timestamp, 0) | |
for _, cs := range resp.ContainersStatuses { | |
cStatus := m.convertToKubeContainerStatus(cs) | |
containerStatuses = append(containerStatuses, cStatus) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to only fix the timestamp. A more consistent approach would be to also fix:
kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go
Lines 1583 to 1589 in f5c5384
// Get the statuses of all containers visible to the pod and // timestamp from sandboxStatus. timestamp = time.Unix(resp.Timestamp, 0) for _, cs := range resp.ContainersStatuses { cStatus := m.convertToKubeContainerStatus(cs) containerStatuses = append(containerStatuses, cStatus) }
This looks like a different issue (#126414) to me because the runtime (CRI-O) sets this timestamp in seconds. This should be fixed in both kubernetes and CRI-O synchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this change does not affect our existing behavior. Referring to the following code, given a unix timestamp in seconds, whether it is kept in seconds or converted to nanoseconds, the behavior will remain consistent.
func main() {
if !compare1() && !compare2() {
fmt.Printf("Their behavior is consistent")
}
}
func compare2() bool {
eventedTime := time.Now().UnixNano()
time.Sleep(100 * time.Millisecond)
// The initial timestamp is seconds
genericTime := time.Now().Unix()
cachedEventedTime := time.Unix(0, eventedTime)
// Convert to nanoseconds
cachedGenericTime := time.Unix(0, genericTime)
if cachedEventedTime.Before(cachedGenericTime) {
fmt.Printf("Expected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime)
return true
} else {
fmt.Printf("Unexpected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime)
return false
}
}
func compare1() bool {
eventedTime := time.Now().UnixNano()
time.Sleep(100 * time.Millisecond)
// The initial timestamp is seconds
genericTime := time.Now().Unix()
cachedEventedTime := time.Unix(0, eventedTime)
// Keep in seconds
cachedGenericTime := time.Unix(genericTime, 0)
if cachedEventedTime.Before(cachedGenericTime) {
fmt.Printf("Expected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime)
return true
} else {
fmt.Printf("Unexpected: evented:%v, generic:%v\n", cachedEventedTime, cachedGenericTime)
return false
}
}
Therefore, we can first fix kubelet and then fix CRI-O to ensure that this functionality works properly in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks to me that CRIO returns nano: https://github.com/cri-o/cri-o/blob/36c2750916eb9e75f01a627463c2cfad207f192d/server/server.go#L934 ? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PodSandboxStatus
interface in CRI-O returns the timestamp in seconds, and kubelet uses this timestamp in genericPLEG
.
kubelet:
kubernetes/pkg/kubelet/pleg/generic.go
Lines 445 to 482 in 342da50
status, err := g.runtime.GetPodStatus(ctx, pod.ID, pod.Name, pod.Namespace) | |
if err != nil { | |
// nolint:logcheck // Not using the result of klog.V inside the | |
// if branch is okay, we just use it to determine whether the | |
// additional "podStatus" key and its value should be added. | |
if klog.V(6).Enabled() { | |
klog.ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status) | |
} else { | |
klog.ErrorS(err, "PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name)) | |
} | |
} else { | |
if klogV := klog.V(6); klogV.Enabled() { | |
klogV.InfoS("PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name), "podStatus", status) | |
} else { | |
klog.V(4).InfoS("PLEG: Write status", "pod", klog.KRef(pod.Namespace, pod.Name)) | |
} | |
// Preserve the pod IP across cache updates if the new IP is empty. | |
// When a pod is torn down, kubelet may race with PLEG and retrieve | |
// a pod status after network teardown, but the kubernetes API expects | |
// the completed pod's IP to be available after the pod is dead. | |
status.IPs = g.getPodIPs(pid, status) | |
} | |
// When we use Generic PLEG only, the PodStatus is saved in the cache without | |
// any validation of the existing status against the current timestamp. | |
// This works well when there is only Generic PLEG setting the PodStatus in the cache however, | |
// if we have multiple entities, such as Evented PLEG, while trying to set the PodStatus in the | |
// cache we may run into the racy timestamps given each of them were to calculate the timestamps | |
// in their respective execution flow. While Generic PLEG calculates this timestamp and gets | |
// the PodStatus, we can only calculate the corresponding timestamp in | |
// Evented PLEG after the event has been received by the Kubelet. | |
// For more details refer to: | |
// https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3386-kubelet-evented-pleg#timestamp-of-the-pod-status | |
if utilfeature.DefaultFeatureGate.Enabled(features.EventedPLEG) && isEventedPLEGInUse() && status != nil { | |
timestamp = status.TimeStamp | |
} | |
return err, g.cache.Set(pod.ID, status, err, timestamp) |
I have already filed an issue in the CRI-O project, and it should be fixed soon, ref: cri-o/cri-o#8580
Please let me start over. /test pull-kubernetes-e2e-kind-evented-pleg |
@hshiina: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Now, only Evented PLEG caching is being fixed. The jobs with EventedPLEG enabled were almost passed because this fix mitigates #123087. (Conversely, this would make it difficult to reproduce #123087.) In addition, another issue appeared with by fix. There was a failure in the job:
With this fix, a pod worker is blocked at kubernetes/pkg/kubelet/pod_workers.go Line 1256 in d913914
We can see a pod worker blocked at pod termination in kubelet.log. This caused the failure:
This issue is addressed in #124704. I shouldn't have mixed it in this PR. It would be better to fix it separately. |
Due to the rapid action of the CRI-O team, this bug has already been fixed and backported to older versions, which means you can apply the fix. kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go Lines 1583 to 1589 in f5c5384
ref: cri-o/cri-o#8582 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
thank you!
LGTM label has been added. Git tree hash: 52a94895c6f7cb0c2e354b53094fa69dab14a195
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hshiina, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
What type of PR is this?
/kind bug
What this PR does / why we need it:
CreatedAt
timestamp ofContainerEventResponse
should be passed as nanoseconds totime.Unix()
.This fix may mitigate #123087. I guess this issue happens as following:
Usually, a container created event is delivered while a pod worker is creating a pod at SyncPod() and the
created
status is cached by the PLEG. AfterSyncPod()
, the pod worker waits at GetNewerThan() until a next event (container started) happens because the cached container status is older thanlastSyncTime
. However, due to the bug in this PR, themodified
time of the cached status is so much newer because a nanoseconds value is set as seconds. Then, the pod worker gets acreated
container status from the cache and goes intoSyncPod()
soon. Then, the worker creates another container.Which issue(s) this PR fixes:
Fixes
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: