-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending AnomalousStatus
to also kill sh
steps
#405
Conversation
Ineffective. |
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
// Also abort any shell steps running on the same node(s): | ||
if (!affectedNodes.isEmpty()) { | ||
StepExecution.applyAll(DurableTaskStep.Execution.class, exec -> { | ||
if (affectedNodes.contains(exec.node)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases (e.g. nodes with multiple executors?) where this could abort steps which are running fine if another step on the same node is having unusual problems? If so, could we check exec.state.cookie
or something more precise than just the node name instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory perhaps, but this monitor is normally used for cloud nodes with one executor, and it seems unlikely that the agent could be connected and functional on one executor and node
block while broken in another one of the same build. (For that matter, it would rarely make any sense to run two concurrent node
blocks in the same build on the same agent.)
In many cases when restoring a build into another K8s cluster using a very lossy backup of the filesystem, via EFS Replication (which does not guarantee snapshot semantics), there is some sort of problem with metadata, which prevents
node
block retry from recovering automatically. (Prior to jenkinsci/kubernetes-plugin#1617 it did not work even if metadata was perfect.) Sometimes there is a missingprogram.dat
, sometimes a corrupted log file, sometimes a missingFlowNode
, etc.But in many of these cases (CloudBees-internal reference), the log seems fine and the flow nodes seem fine, yet for reasons I cannot easily follow because
program.dat
is so opaque, thenode
block seems to have received anOutcome.abnormal
with the expectedFlowInterrupedException
fromExecutorStepDynamicContext.resume
; there are also some suppressed exceptions, andAnomalousStatus
just adds to this list, without causing the build to proceed. CallingCpsStepContext.scheduleNextRun()
from the script console does not help either. However in most of these cases it does seem to work to abort thesh
step running inside: somehow that “wakes up” the program, which then fails thenode
block in the expected way, letting theretry
step kick in and ultimately letting the build run to completion.