Build cancellation leaves orphaned QEMU processes: no timeout on state polling loop

## Problem

When a `podman build --platform <foreign-arch>` is cancelled (Ctrl+C / SIGINT), QEMU user-mode emulation processes (`qemu-*-static`) spawned during `RUN` steps are left running as orphans. They get reparented to PID 1 and persist indefinitely.

This is caused by buildah's signal handling in `run_common.go` which has no timeout on its container state polling loop and no cgroup-level kill fallback.

## Reproduction

Automated repro with GitHub Actions: **https://github.com/kaovilai/qemu-build-hang-repro**

The [`cleanup-gap.yml`](https://github.com/kaovilai/qemu-build-hang-repro/actions/workflows/cleanup-gap.yml) workflow demonstrates the issue:

1. Starts `podman build --platform linux/arm64` with a Containerfile that spawns long-running processes under QEMU emulation
2. Sends SIGINT → SIGTERM → SIGKILL to the `podman build` process
3. Checks for surviving processes afterward

**Result**: 5 orphaned `qemu-aarch64-static` processes found with `PPid=1` (reparented to init).

Example output from CI:
```
PID 2670: /usr/bin/qemu-aarch64-static /bin/sh -c ...   PPid: 1  wchan: sigsuspend
PID 2683: /usr/bin/qemu-aarch64-static /bin/sleep 300   PPid: 2670  wchan: hrtimer_nanosleep
PID 2685: /usr/bin/qemu-aarch64-static /bin/sleep 300   PPid: 2670  wchan: hrtimer_nanosleep
PID 2687: /usr/bin/qemu-aarch64-static /bin/sleep 300   PPid: 2670  wchan: hrtimer_nanosleep
PID 2689: /usr/bin/qemu-aarch64-static /bin/sleep 300   PPid: 2670  wchan: hrtimer_nanosleep
```

## Root Cause

### Signal handler sends SIGKILL but never times out (`run_common.go:656-705`)

When SIGINT/SIGTERM is received, the signal handler sends SIGKILL to the container via the OCI runtime, then polls container state every 100ms **with no deadline**:

```go
// line 659-664: send SIGKILL on any signal
go func() {
    for range interrupted {
        if err := kill("SIGKILL").Run(); err != nil {
            logrus.Errorf("%v sending SIGKILL", err)
        }
    }
}()

// line 665-705: poll state forever
for {
    select {
    case <-time.After(100 * time.Millisecond):
        stat := exec.Command(runtime, append(options.Args, "state", containerName)...)
        // checks if StateStopped — but never times out
    }
}
```

If the container processes don't respond to SIGKILL (e.g., QEMU in uninterruptible sleep / D state from [QEMU #2738](https://gitlab.com/qemu-project/qemu/-/work_items/2738)), this loop runs forever.

### Parent process has no timeout either (`run_common.go:1236-1244`)

The parent process forwards signals to the child subprocess but blocks indefinitely on `cmd.Wait()` (~line 1297):

```go
go func() {
    for receivedSignal := range interrupted {
        if err := cmd.Process.Signal(receivedSignal); err != nil {
            logrus.Infof("%v while attempting to forward %v to child process", err, receivedSignal)
        }
    }
}()
```

### No cgroup-level cleanup

Even when individual processes don't respond to SIGKILL, the container's cgroup could be used to force-kill all processes. This fallback doesn't exist.

## Impact

- **On Linux**: orphaned QEMU processes consume resources until manually killed
- **On macOS (podman machine)**: orphans persist inside the VM with no host-side visibility — user must `podman machine ssh` to discover and kill them
- **In CI/CD**: orphans can accumulate across builds, consuming runner resources

## Suggested Fix

Add a timeout to the state polling loop with cgroup kill fallback:

```go
deadline := time.After(30 * time.Second)
for {
    select {
    case <-deadline:
        logrus.Warnf("container %s did not stop after SIGKILL, force-killing cgroup", containerName)
        // Force-kill via cgroup as last resort
        cgroupKill(containerName)
        return
    case <-time.After(100 * time.Millisecond):
        // existing state check...
    }
}
```

## Workaround

Manual cleanup function that SSHes into podman machine and kills stuck QEMU processes:
https://github.com/kaovilai/dotfiles/blob/main/zsh/functions/podman-utils.zsh#L63

## Related

- [QEMU #2738 — golang 1.23 build hangs under qemu-user](https://gitlab.com/qemu-project/qemu/-/work_items/2738) (the QEMU bug that makes this worse)
- Repro repo: https://github.com/kaovilai/qemu-build-hang-repro

> [!Note]
> Responses generated with Claude

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build cancellation leaves orphaned QEMU processes: no timeout on state polling loop #6786

Problem

Reproduction

Root Cause

Signal handler sends SIGKILL but never times out (`run_common.go:656-705`)

Parent process has no timeout either (`run_common.go:1236-1244`)

No cgroup-level cleanup

Impact

Suggested Fix

Workaround

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Build cancellation leaves orphaned QEMU processes: no timeout on state polling loop #6786

Description

Problem

Reproduction

Root Cause

Signal handler sends SIGKILL but never times out (run_common.go:656-705)

Parent process has no timeout either (run_common.go:1236-1244)

No cgroup-level cleanup

Impact

Suggested Fix

Workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Signal handler sends SIGKILL but never times out (`run_common.go:656-705`)

Parent process has no timeout either (`run_common.go:1236-1244`)