Extending `AnomalousStatus` to also kill `sh` steps #405

jglick · 2024-11-05T23:57:50Z

In many cases when restoring a build into another K8s cluster using a very lossy backup of the filesystem, via EFS Replication (which does not guarantee snapshot semantics), there is some sort of problem with metadata, which prevents node block retry from recovering automatically. (Prior to jenkinsci/kubernetes-plugin#1617 it did not work even if metadata was perfect.) Sometimes there is a missing program.dat, sometimes a corrupted log file, sometimes a missing FlowNode, etc.

But in many of these cases (CloudBees-internal reference), the log seems fine and the flow nodes seem fine, yet for reasons I cannot easily follow because program.dat is so opaque, the node block seems to have received an Outcome.abnormal with the expected FlowInterrupedException from ExecutorStepDynamicContext.resume; there are also some suppressed exceptions, and AnomalousStatus just adds to this list, without causing the build to proceed. Calling CpsStepContext.scheduleNextRun() from the script console does not help either. However in most of these cases it does seem to work to abort the sh step running inside: somehow that “wakes up” the program, which then fails the node block in the expected way, letting the retry step kick in and ultimately letting the build run to completion.

jglick · 2024-11-06T17:51:07Z

Ineffective.

src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java

dwnusbaum · 2024-11-07T15:17:09Z

src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java

+            // Also abort any shell steps running on the same node(s):
+            if (!affectedNodes.isEmpty()) {
+                StepExecution.applyAll(DurableTaskStep.Execution.class, exec -> {
+                    if (affectedNodes.contains(exec.node)) {


Are there cases (e.g. nodes with multiple executors?) where this could abort steps which are running fine if another step on the same node is having unusual problems? If so, could we check exec.state.cookie or something more precise than just the node name instead?

In theory perhaps, but this monitor is normally used for cloud nodes with one executor, and it seems unlikely that the agent could be connected and functional on one executor and node block while broken in another one of the same build. (For that matter, it would rarely make any sense to run two concurrent node blocks in the same build on the same agent.)

Extending AnomalousStatus to also kill sh steps

cefa50f

jglick added the enhancement label Nov 5, 2024

jglick closed this Nov 6, 2024

jglick deleted the AnomalousStatus branch November 6, 2024 17:51

jglick commented Nov 6, 2024

View reviewed changes

src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java Outdated Show resolved Hide resolved

jglick restored the AnomalousStatus branch November 6, 2024 18:46

Trying a different approach given jenkinsci#405 (comment)

aea9bf0

jglick reopened this Nov 6, 2024

jglick added 2 commits November 6, 2024 13:57

Test fixes

591f289

Make AnomalousStatus run more frequently

4043ebf

jglick requested a review from dwnusbaum November 6, 2024 23:51

jglick marked this pull request as ready for review November 6, 2024 23:51

jglick requested a review from a team as a code owner November 6, 2024 23:51

dwnusbaum approved these changes Nov 7, 2024

View reviewed changes

jglick merged commit 6a3e903 into jenkinsci:master Nov 7, 2024
17 checks passed

jglick deleted the AnomalousStatus branch November 7, 2024 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending `AnomalousStatus` to also kill `sh` steps #405

Extending `AnomalousStatus` to also kill `sh` steps #405

jglick commented Nov 5, 2024 •

edited

Loading

jglick commented Nov 6, 2024

dwnusbaum Nov 7, 2024

jglick Nov 7, 2024

Extending AnomalousStatus to also kill sh steps #405

Extending AnomalousStatus to also kill sh steps #405

Conversation

jglick commented Nov 5, 2024 • edited Loading

jglick commented Nov 6, 2024

dwnusbaum Nov 7, 2024

Choose a reason for hiding this comment

jglick Nov 7, 2024

Choose a reason for hiding this comment

Extending `AnomalousStatus` to also kill `sh` steps #405

Extending `AnomalousStatus` to also kill `sh` steps #405

jglick commented Nov 5, 2024 •

edited

Loading