Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending AnomalousStatus to also kill sh steps #405

Merged
merged 4 commits into from
Nov 7, 2024

Conversation

jglick
Copy link
Member

@jglick jglick commented Nov 5, 2024

In many cases when restoring a build into another K8s cluster using a very lossy backup of the filesystem, via EFS Replication (which does not guarantee snapshot semantics), there is some sort of problem with metadata, which prevents node block retry from recovering automatically. (Prior to jenkinsci/kubernetes-plugin#1617 it did not work even if metadata was perfect.) Sometimes there is a missing program.dat, sometimes a corrupted log file, sometimes a missing FlowNode, etc.

But in many of these cases (CloudBees-internal reference), the log seems fine and the flow nodes seem fine, yet for reasons I cannot easily follow because program.dat is so opaque, the node block seems to have received an Outcome.abnormal with the expected FlowInterrupedException from ExecutorStepDynamicContext.resume; there are also some suppressed exceptions, and AnomalousStatus just adds to this list, without causing the build to proceed. Calling CpsStepContext.scheduleNextRun() from the script console does not help either. However in most of these cases it does seem to work to abort the sh step running inside: somehow that “wakes up” the program, which then fails the node block in the expected way, letting the retry step kick in and ultimately letting the build run to completion.

@jglick
Copy link
Member Author

jglick commented Nov 6, 2024

Ineffective.

@jglick jglick closed this Nov 6, 2024
@jglick jglick deleted the AnomalousStatus branch November 6, 2024 17:51
@jglick jglick restored the AnomalousStatus branch November 6, 2024 18:46
@jglick jglick reopened this Nov 6, 2024
@jglick jglick requested a review from dwnusbaum November 6, 2024 23:51
@jglick jglick marked this pull request as ready for review November 6, 2024 23:51
@jglick jglick requested a review from a team as a code owner November 6, 2024 23:51
// Also abort any shell steps running on the same node(s):
if (!affectedNodes.isEmpty()) {
StepExecution.applyAll(DurableTaskStep.Execution.class, exec -> {
if (affectedNodes.contains(exec.node)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases (e.g. nodes with multiple executors?) where this could abort steps which are running fine if another step on the same node is having unusual problems? If so, could we check exec.state.cookie or something more precise than just the node name instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory perhaps, but this monitor is normally used for cloud nodes with one executor, and it seems unlikely that the agent could be connected and functional on one executor and node block while broken in another one of the same build. (For that matter, it would rarely make any sense to run two concurrent node blocks in the same build on the same agent.)

@jglick jglick merged commit 6a3e903 into jenkinsci:master Nov 7, 2024
17 checks passed
@jglick jglick deleted the AnomalousStatus branch November 7, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants