[Flaking Test] [EventedPLEG] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop #127312

pacoxu · 2024-09-12T07:36:40Z

Which jobs are flaking?

ci-crio-cgroupv1-evented-pleg

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-crio-cgroupv1-evented-pleg/1834117609461649408

Which tests are flaking?

E2eNode Suite.[It] [sig-node] [NodeConformance] Containers Lifecycle when a pod is terminating because its liveness probe fails should continue running liveness probes for restartable init containers and restart them while in preStop [NodeConformance]

Since when has it been flaking?

8/24

https://storage.googleapis.com/k8s-triage/index.html?date=2024-09-12&job=ci-crio-cgroupv1-evented-pleg&test=%20Containers%20Lifecycle%20when%20a%20pod%20is%20terminating%20because%20its%20liveness%20probe%20fails%20should%20continue%20running%20liveness%20probes%20for%20restartable%20init%20containers%20and%20restart%20them%20while%20in%20preStop%20

Testgrid link

https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-evented-pleg

Reason for failure (if possible)

{ failed [FAILED] Expected an error to have occurred.  Got:
    <nil>: nil
In [It] at: k8s.io/kubernetes/test/e2e_node/container_lifecycle_test.go:903 @ 08/23/24 18:16:25.131
}

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

The text was updated successfully, but these errors were encountered:

pacoxu · 2024-09-12T07:37:37Z

/cc @hshiina @SergeyKanzhelev

pacoxu · 2024-09-12T08:09:07Z

@hshiina this seem to be the same problem of #123087 static pod when EventedPLEG is enabled.)

Fix evented pleg mirror pod & use IsEventedPLEGInUse instead of FG status check #122778 (comment) has a reproduce method.
Restart the init container to not be stuck in created state #126543 fixed a Generic PLEG regression.

/cc @gjkim42 @liggitt

hshiina · 2024-09-12T12:54:12Z

As far as I saw the log, containers does not look to have been recreated.

If I understand correctly, this test works like:

The liveness probe for the regular container fails.
kubelet starts to stop the regular container. Then, the prestop hook is triggered.

If the liveness probe for the sidecar container runs and fails before the prestop, this assertion is passed. If the probe runs while the prostop is running, this assertion fails:

kubernetes/test/e2e_node/container_lifecycle_test.go

Lines 902 to 903 in 7ad1eaa

    
           err = results.RunTogetherLhsFirst(prefixedName(PreStopPrefix, regular1), prefixedName(LivenessPrefix, restartableInit1)) 
        
           gomega.Expect(err).To(gomega.HaveOccurred())

I'm afraid I'm not sure what is expected to guarantee the liveness probe for the sidecar container (restartable-init-1) to run or stop before the prestop starts.

hshiina · 2024-09-12T17:49:20Z

Due to #124297 which was recently merged, another issue (#124704) appeared. Pod workers sometimes get blocked for a few seconds in kubelet like #124297 (comment). This may make something like race condition surface.

SergeyKanzhelev · 2024-09-12T23:12:28Z

/retitle [Flaking Test] [EventedPLEG] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop

SergeyKanzhelev · 2024-09-12T23:13:07Z

Marking with evented PLEG.

Is the issue also hapenning outside the evented PLEG?

hshiina · 2024-09-13T09:21:23Z

I don't think this happens outside the evented PLEG.
Usually, the init container gets into CrashLoopBackOff before the liveness probe for the regular container whose InitialDelaySeconds is 10 starts. So, the liveness probe for the init container does not run while the prestop is running.

If the pod worker works slowly with blocked by #124704, the init container may not get into CrashLoopBackOff.

SergeyKanzhelev · 2024-09-18T17:33:33Z

/assign @hshiina
since the PR is opened.

This is for alpha feature and NOT release blocking

/priority backlog
/triage accepted

pacoxu · 2024-10-12T00:54:05Z

It failed in pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2 as well.

https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2

pacoxu added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 12, 2024

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 12, 2024

github-project-automation bot added this to SIG Node CI/Test Board Sep 12, 2024

github-project-automation bot moved this to Triage in SIG Node CI/Test Board Sep 12, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 12, 2024

hshiina mentioned this issue Sep 12, 2024

Migrate PLEG to contextual logging #126843

Merged

hshiina linked a pull request Sep 13, 2024 that will close this issue

EventedPLEG: Stop waiting for cache update when it is not expected #124953

Open

pacoxu mentioned this issue Sep 18, 2024

pleg: enable tests on file changes kubernetes/test-infra#33463

Open

k8s-ci-robot assigned hshiina Sep 18, 2024

k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 18, 2024

SergeyKanzhelev moved this from Triage to Issues - In progress in SIG Node CI/Test Board Sep 18, 2024

This was referenced Oct 4, 2024

e2e node: Test probes during pod termination #127863

Open

EventedPLEG: Update global cache timestamp more frequently #127954

Open

pacoxu mentioned this issue Oct 11, 2024

Failing SIG-Node presubmit jobs #127831

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaking Test] [EventedPLEG] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop #127312

[Flaking Test] [EventedPLEG] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop #127312

pacoxu commented Sep 12, 2024

pacoxu commented Sep 12, 2024

pacoxu commented Sep 12, 2024 •

edited

Loading

hshiina commented Sep 12, 2024 •

edited

Loading

hshiina commented Sep 12, 2024

SergeyKanzhelev commented Sep 12, 2024

SergeyKanzhelev commented Sep 12, 2024

hshiina commented Sep 13, 2024 •

edited

Loading

SergeyKanzhelev commented Sep 18, 2024

pacoxu commented Oct 12, 2024

[Flaking Test] [EventedPLEG] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop #127312

[Flaking Test] [EventedPLEG] Containers Lifecycle should continue running liveness probes for restartable init containers and restart them while in preStop #127312

Comments

pacoxu commented Sep 12, 2024

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

pacoxu commented Sep 12, 2024

pacoxu commented Sep 12, 2024 • edited Loading

hshiina commented Sep 12, 2024 • edited Loading

hshiina commented Sep 12, 2024

SergeyKanzhelev commented Sep 12, 2024

SergeyKanzhelev commented Sep 12, 2024

hshiina commented Sep 13, 2024 • edited Loading

SergeyKanzhelev commented Sep 18, 2024

pacoxu commented Oct 12, 2024

pacoxu commented Sep 12, 2024 •

edited

Loading

hshiina commented Sep 12, 2024 •

edited

Loading

hshiina commented Sep 13, 2024 •

edited

Loading