Skip to content

Latest commit

 

History

History
217 lines (165 loc) · 10.5 KB

File metadata and controls

217 lines (165 loc) · 10.5 KB

Handling github job cancellation signals

Demos for github job cancellation signal handling to show exactly how it behaves. (Spoiler: badly).

Why is this needed? Because the github actions docs for this appear to be nonexistent at time of writing, and the answer on the relevant github "community" forum post are nowhere near clear or detailed enough. (It's always possible I'm just bad at searching, but you'd think the post above would link to the docs if that were the case).

I was having issues with Terraform runs leaving lock files lying around on S3 state storage buckets when github jobs were cancelled. This shouldn't be happening, because terraform will try to clean up and remove its lock file when gracefully killed. Given the lack of usable github docs, I had to write a demo to find out what exactly happens when a github actions job is cancelled.

TL;DR: unless special care is taken, child processes don't get any signal delivered to them before they're hard-destroyed when a github actions job is cancelled. As a workaround, if: always() blocks can be used to do necessary cleanup steps as they run on cancel. Or for simple cases you can exec your process, so it becomes the top-level process for a step and does receive signals on cancel.

The demo code

A set of github actions demonstrate how github job cancellation signal handling works. To try them you must run them via github actions, then cancel them using the github actions UI.

  • cancel-test-exec-child-ignore-sigquit.yaml: exec's a script that ignores SIGINT, SIGQUIT and SIGTERM. It shows that github delivers a SIGINT, waits 7.5s, delivers a SIGTERM, waits 2.5s, then presumably sends a SIGKILL. It then runs any if: always() steps after destroying all processes running in the killed step.

  • cancel-test-exec-child.yaml: exec's the same script, but only ignores SIGINT, so the child process will terminate on the subsequent SIGTERM. This is more realistic, and the subsequent test cases do the same thing.

    Since it checks the process tree in the if: always() cleanup step, this test also shows that github destroys all processes under the step recursively before it begins any cleanup steps. It must be keeping track of all processes.

  • cancel-test-shell-plain.yaml: Represents the "normal" case of a github actions step using a bash shell that runs a child process as a blocking command within the shell. You will see that the child process (the same script as the above demo) does not receive any SIGINT or SIGTERM. The bash leader process does, but you can't see that because bash defers trap execution until child process exit when blocking waiting for a child process, and the whole lot gets SIGKILL'd before the child process exits to return control to bash.

    This means that the workload running in the inner script got no chance to clean up its work.

  • cancel-test-shell-sigfwd.yaml: Demonstrates that it is possible to use a top-level shell with job control enabled as a process-group leader that forwards signals to its child processes. It's ugly though. Because of deferred traps, every subcommand that needs a chance to handle signals must be run as a background job with & then waited for, and there's plenty of fiddling about to make it work.

    See comments in signal_forwarding_wrapper.sh for details.

  • cancel-test-long-running-if-always.yaml: Explores what happens when an if: always() step takes too long or refuses to exit.

    It seems like github will let the cleanup run for about 4 minutes then kill it, initially with a SIGINT.

    Repeated cancels sent during the if: always() run appear to have no effect.

    Interestingly, it the job won't retain logs if the cleanup job doesn't exit within the overall job timeout, you can only see the logs if you were streaming them during the run.

  • test-daemonize.yaml and test-daemonize-cancel.yaml: Shows that by detaching from the current session a process can remain running after the end of the containing step, or the cancellation of the step.

    So it's not possible to rely on Github to terminate everything running under a step; there could still be workloads running when your if: always() steps run after cancellation of work. You may need to ensure you record their pids and kill them yourself before handling cleanup.

    I presume they're destroyed when the containing job is cleaned up - I didn't bother giving the task telemetry to some public endpoint so it could report how long it stays running for.

Why child-process tasks don't get a chance to clean up on job cancel

Consider a simple job step like:

   - name: whatever
     shell: bash
     run: |
       my-long-running-task

You might expect that if the actions run is cancelled, my-long-running-task would get some kind of signal to give it a chance to clean up before the whole actions runner is destroyed. As if you'd pressed Ctrl-C in the shell, then waited a bit, then pressed Ctrl-\ (break).

In reality, it exits (presumably on SIGKILL) without any chance to clean up.

On cancel, github actions delivers a SIGINT only to the top-level process for the current step of each active job. Then 7.5s later it delivers a SIGTERM, again to the top-level process only. 2.5s later it sends a SIGKILL (presumably to everything in the process tree).

You'd think that's fine. But signals don't propagate down process trees, so child processes running under the top-level step process won't see a signal unless the top-level process explicit forwards it.

A typical Github actions job will be a run step with shell: bash that invokes some task as a child process of the step's top-level shell. If you cancel a job with this, github actions will signal the top pid (the shell) with SIGINT.

Bash will behave as documented:

When Bash is running without job control enabled and receives SIGINT while waiting for a foreground command, it waits until that foreground command terminates and then decides what to do about the SIGINT [...]

The process will never get that SIGINT, so it'll never exit and bash never gets to do anything. And you can't use a trap on SIGINT to forward signals to the child process(es) either, because:

If Bash is waiting for a command to complete and receives a signal for which a trap has been set, the trap will not be executed until the command completes.

This issue isn't specific to bash, it's just a useful demo because it's the widely used default for github actions.

What github actions should be doing

Instead of signalling the parent process, github actions should at least offer the option of signalling all processes running under the current step.

A simple albeit imperfect way to do this is to spawn the bash shell as a process group leader by setting the -m flag. Github actions would then signal the whole process group for each active step when a job cancel request is sent by sending a kill signal to the negated pid of the leader shell. This works well for simple cases, but process groups aren't nestable, so if some workload under the step creates its own process group it'll be orphaned and won't receive the process-group signal unless its parent notices and propagates it explicitly.

It'd be better still if it used linux sessions, pid namespaces, or other suitable constructs that can properly contain all processes in the step and signal them all at once.

While bash's usual awful-ness contributes to making this hard to get right, GitHub's lack of docs or any means to configure the job signal handling to deliver to a process group certainly makes it much worse.

Workarounds

Until/unless github improves their actions runners with configurable cancel timeouts and the option to signal the whole process tree, there are a few possible workarounds:

  • exec any long-running tasks that might need a chance to respond to SIGINT or SIGTERM with cleanup steps before termination, so they become the top-level process in a step. Split series of such tasks into separate github actions steps instead of a single step with multiple sub-commands called by a shell.

    This'll work ok if your job can respond to SIGINT and clean up within 7.5s, or to SIGTERM and clean up within 2.5s. But if it's relying on network resources that's going to be iffy - and it probably is, since the runner itself is usually stateless.

  • Have your command or the controlling shell write a state file to the runner working directory and remove it on clean exit. Then use an if: always() action to check for the state file and perform the cleanup actions if found.

    There is going to be a race between the task completing its work and deleting the state file, so you'll need to ensure that the cleanup action is idempotent. In other words, it must be safe to run the cleanup steps twice.

    For example, if you're unlocking something you should write a unique lock-instance-id to your state file. Then when you unlock it, you can ensure the unlock request gets silently ignored if the lock is currently locked with a different lock-instance-id. Handily, this is how Terraform force-unlock works.

    It's unclear what the rules for if: always() actions are during github actions cancellation. From a quick experiment it looks like you've got about 4 minutes to complete cleanup, and if: always() steps ignore cancel requests.

  • Hope that steps inside Docker containers work better?

    I haven't tested this yet but maybe github has saner cancel behaviour when a job runs in a user-defined container?

To investigate further

  • How github behaves with jobs using containers
  • Why github actions, which is a job control system, doesn't document fundamental and basic properties of job cancellation behaviour.

See also