Files
act_runner/act/container/process_windows_test.go
Nicolas c749e52bb7 fix(cleanup): kill Windows step process tree on cancel to avoid hang (#1011)
## Problem

Cancelling a job on a Windows host runner can leave the spawned process
tree running and hang the runner. When a step launches a shell that
starts a child which in turn spawns further GUI/background processes,
cancelling the job kills only the direct child (the default
`exec.CommandContext` behaviour). The surviving descendants inherited
the step's stdout/stderr pipe, so the read end never hit EOF and
`cmd.Wait()` blocked forever.

Because the step executor never returned:
- the orphaned processes kept running (the cancelled work was not
  actually stopped), and
- end-of-job cleanup (`Remove` → `terminateRunningProcesses`) was never
  reached, so the runner appeared to go offline / stop picking up jobs.

`CREATE_NEW_PROCESS_GROUP` does not help here — it affects Ctrl-C signal
delivery, not handle inheritance or tree termination.

## Fix

- Assign each Windows step process to a **Job Object** immediately after
  `cmd.Start()`. Descendants created afterwards are automatically part
  of the job.
- Override `cmd.Cancel` to `TerminateJobObject`, so cancellation kills
  the **entire descendant tree** atomically. This also closes the
  inherited pipe handles, so `cmd.Wait()` can return.
- Set `cmd.WaitDelay` (10s) as a safety net: once the process has
  exited, Wait force-closes the pipes and returns rather than blocking
  forever — covering the case where the job-object setup fails (e.g.
  nested-job restrictions), in which we fall back to the previous
  single-process kill.
- The Job Object is created **without** `JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE`,
  so closing the handle on normal completion does not kill legitimate
  background processes; the tree is only torn down on explicit cancel.

Implemented behind `runtime.GOOS == "windows"` with a Windows-only
`processKiller` (Job Object) and no-op stubs elsewhere, so non-Windows
behaviour (default cancellation + `Setpgid`) is unchanged.

## Changes

- `act/container/process_windows.go` — Job Object `processKiller`
  (create / assign / terminate).
- `act/container/process_other.go` — no-op stubs (`//go:build !windows`).
- `act/container/host_environment.go` — wire `cmd.Cancel` (tree kill)
  and `cmd.WaitDelay` into `exec()`.
- `go.mod` / `go.sum` — promote `golang.org/x/sys` to a direct
  dependency.

## Testing

I fully tested it already

## Notes

Follow-up to the Windows leftover-process reaping in #996: that sweep
now actually runs on cancellation because the step no longer hangs
before reaching it.

Reviewed-on: https://gitea.com/gitea/runner/pulls/1011
Reviewed-by: techknowlogick <9+techknowlogick@noreply.gitea.com>
2026-06-02 16:53:27 +00:00

79 lines
2.5 KiB
Go

// Copyright 2026 The Gitea Authors. All rights reserved.
// SPDX-License-Identifier: MIT
package container
import (
"fmt"
"os"
"os/exec"
"path/filepath"
"strconv"
"strings"
"testing"
"time"
"github.com/stretchr/testify/require"
"golang.org/x/sys/windows"
)
// processAlive reports whether pid refers to a still-running process.
func processAlive(pid int) bool {
h, err := windows.OpenProcess(windows.PROCESS_QUERY_LIMITED_INFORMATION, false, uint32(pid))
if err != nil {
return false
}
defer windows.CloseHandle(h)
var code uint32
if err := windows.GetExitCodeProcess(h, &code); err != nil {
return false
}
const stillActive = 259 // STILL_ACTIVE
return code == stillActive
}
// TestProcessKillerKillsTree verifies that a process assigned to the Job Object
// is terminated together with a child it spawns afterwards. This mirrors a step
// that launches a child which spawns further processes, where cancelling the
// job must take down the whole tree, not just the direct child.
func TestProcessKillerKillsTree(t *testing.T) {
dir := t.TempDir()
pidFile := filepath.Join(dir, "child.pid")
// Parent powershell spawns a detached, long-lived child powershell (writing
// its PID to a file) and then sleeps. The child is launched AFTER the parent
// has been assigned to the job, so it must be captured by the job too.
script := fmt.Sprintf(
`$c = Start-Process powershell -PassThru -ArgumentList '-NoProfile','-Command','Start-Sleep -Seconds 600'; `+
`Set-Content -LiteralPath %q -Value $c.Id; Start-Sleep -Seconds 600`, pidFile)
cmd := exec.Command("powershell.exe", "-NoProfile", "-Command", script)
require.NoError(t, cmd.Start())
t.Cleanup(func() { _ = cmd.Process.Kill() })
killer, err := newProcessKiller(cmd.Process)
require.NoError(t, err)
defer killer.Close()
// Wait for the child PID to be reported.
var childPID int
require.Eventually(t, func() bool {
b, e := os.ReadFile(pidFile)
if e != nil {
return false
}
s := strings.TrimSpace(string(b))
if s == "" {
return false
}
childPID, _ = strconv.Atoi(s)
return childPID > 0 && processAlive(childPID)
}, 20*time.Second, 200*time.Millisecond, "child process should start")
// Killing the job must terminate both the parent and the detached child.
require.NoError(t, killer.Kill())
require.Eventually(t, func() bool {
return !processAlive(cmd.Process.Pid) && !processAlive(childPID)
}, 20*time.Second, 200*time.Millisecond, "parent and child should both be terminated")
}