fix(cleanup): kill Windows step process tree on cancel to avoid hang (#1011)

## Problem

Cancelling a job on a Windows host runner can leave the spawned process
tree running and hang the runner. When a step launches a shell that
starts a child which in turn spawns further GUI/background processes,
cancelling the job kills only the direct child (the default
`exec.CommandContext` behaviour). The surviving descendants inherited
the step's stdout/stderr pipe, so the read end never hit EOF and
`cmd.Wait()` blocked forever.

Because the step executor never returned:
- the orphaned processes kept running (the cancelled work was not
  actually stopped), and
- end-of-job cleanup (`Remove` → `terminateRunningProcesses`) was never
  reached, so the runner appeared to go offline / stop picking up jobs.

`CREATE_NEW_PROCESS_GROUP` does not help here — it affects Ctrl-C signal
delivery, not handle inheritance or tree termination.

## Fix

- Assign each Windows step process to a **Job Object** immediately after
  `cmd.Start()`. Descendants created afterwards are automatically part
  of the job.
- Override `cmd.Cancel` to `TerminateJobObject`, so cancellation kills
  the **entire descendant tree** atomically. This also closes the
  inherited pipe handles, so `cmd.Wait()` can return.
- Set `cmd.WaitDelay` (10s) as a safety net: once the process has
  exited, Wait force-closes the pipes and returns rather than blocking
  forever — covering the case where the job-object setup fails (e.g.
  nested-job restrictions), in which we fall back to the previous
  single-process kill.
- The Job Object is created **without** `JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE`,
  so closing the handle on normal completion does not kill legitimate
  background processes; the tree is only torn down on explicit cancel.

Implemented behind `runtime.GOOS == "windows"` with a Windows-only
`processKiller` (Job Object) and no-op stubs elsewhere, so non-Windows
behaviour (default cancellation + `Setpgid`) is unchanged.

## Changes

- `act/container/process_windows.go` — Job Object `processKiller`
  (create / assign / terminate).
- `act/container/process_other.go` — no-op stubs (`//go:build !windows`).
- `act/container/host_environment.go` — wire `cmd.Cancel` (tree kill)
  and `cmd.WaitDelay` into `exec()`.
- `go.mod` / `go.sum` — promote `golang.org/x/sys` to a direct
  dependency.

## Testing

I fully tested it already

## Notes

Follow-up to the Windows leftover-process reaping in #996: that sweep
now actually runs on cancellation because the step no longer hangs
before reaching it.

Reviewed-on: https://gitea.com/gitea/runner/pulls/1011
Reviewed-by: techknowlogick <9+techknowlogick@noreply.gitea.com>
This commit is contained in:
Nicolas
2026-06-02 16:53:27 +00:00
parent f17b6b9fc3
commit c749e52bb7
6 changed files with 205 additions and 5 deletions

View File

@@ -322,6 +322,30 @@ func (e *HostEnvironment) exec(ctx context.Context, command []string, cmdline st
cmd.Stderr = e.StdOut
cmd.Dir = wd
cmd.SysProcAttr = getSysProcAttr(cmdline, false)
// On Windows a step often launches a process tree (a shell that starts a
// child which spawns further GUI or background processes). The default
// context cancellation only kills the direct child, leaving the rest of the
// tree running; and because the orphans inherit cmd's stdout/stderr pipe,
// cmd.Wait() would block forever, hanging the runner. Kill the whole tree
// via a Job Object on cancellation, and bound the wait so a leftover pipe
// writer can never hang Wait indefinitely.
var killer atomic.Pointer[processKiller]
if runtime.GOOS == "windows" {
cmd.Cancel = func() error {
if k := killer.Load(); k != nil {
return k.Kill()
}
if cmd.Process != nil {
return cmd.Process.Kill()
}
return nil
}
// Once the step process has exited, give its I/O pipes at most this long
// to drain before Wait force-closes them and returns (Go's WaitDelay).
cmd.WaitDelay = 10 * time.Second
}
var ppty *os.File
var tty *os.File
defer func() {
@@ -351,6 +375,18 @@ func (e *HostEnvironment) exec(ctx context.Context, command []string, cmdline st
if err := cmd.Start(); err != nil {
return err
}
if runtime.GOOS == "windows" {
// Assign the started process to a Job Object so cmd.Cancel can kill the
// whole descendant tree. Children spawned afterwards are auto-included.
// On failure (e.g. nested-job restrictions) we fall back to the default
// single-process kill; WaitDelay + end-of-job cleanup still apply.
if k, kerr := newProcessKiller(cmd.Process); kerr != nil {
common.Logger(ctx).Warnf("process tree kill setup failed, falling back to single-process kill: %v", kerr)
} else {
killer.Store(k)
defer k.Close()
}
}
err = cmd.Wait()
if err != nil {
var exitErr *exec.ExitError