fix(cleanup): kill Unix step process group on cancel to avoid hang (#1025)

Cancelling a job on a Linux/macOS host runner can leave the spawned process
tree running and hang the runner — the same failure mode fixed for Windows in
#1011, just on the other platforms.

Steps are launched as process-group leaders (`Setpgid`, or `Setsid` for the PTY
path), but the default `exec.CommandContext` cancellation only kills the
**direct child**. When a step launches a shell that starts a child which in turn
spawns further background processes, cancelling the job leaves the descendants
running. Because those orphans inherited the step's stdout/stderr pipe, the read
end never hits EOF and `cmd.Wait()` blocks forever.

Because the step executor never returns:
- the orphaned processes keep running (the cancelled work is not actually
  stopped), and
- end-of-job cleanup is never reached, so the runner appears to go offline / stop
  picking up jobs.

## Fix

Apply the same tree-kill approach as Windows, using the Unix counterpart of a Job
Object: the **process group**.

- Add a Unix `processKiller` (`process_unix.go`) that captures the step's PGID
  (== PID, since the step is launched as a group leader) and sends `SIGKILL` to
  the whole group on cancellation. This also closes the inherited pipe handles so
  `cmd.Wait()` can return. `ESRCH` (group already gone) is not treated as an error.
- Restrict the previous no-op stub (`process_other.go`) to `plan9` and have it
  fall back to a single-process kill, preserving plan9's prior behaviour.
- Wire `cmd.Cancel` (tree kill) and `cmd.WaitDelay` (10s) **unconditionally** in
  `exec()` instead of Windows-only. `WaitDelay` also covers a step that
  backgrounds a process holding the pipe open after the main process exits.

Reviewed-on: https://gitea.com/gitea/runner/pulls/1025
Reviewed-by: Zettat123 <39446+zettat123@noreply.gitea.com>
This commit is contained in:
Nicolas
2026-06-14 20:52:42 +00:00
parent 205af7cd01
commit 3996d6d032
4 changed files with 202 additions and 37 deletions

View File

@@ -0,0 +1,100 @@
// Copyright 2026 The Gitea Authors. All rights reserved.
// SPDX-License-Identifier: MIT
//go:build !windows && !plan9
package container
import (
"fmt"
"os"
"os/exec"
"path/filepath"
"strconv"
"strings"
"syscall"
"testing"
"time"
"github.com/stretchr/testify/require"
)
// processAlive reports whether pid refers to a still-running process. Signal 0
// performs error checking without delivering a signal: a nil error (or EPERM)
// means the process exists, ESRCH means it is gone.
//
// On Linux, zombie processes (state Z in /proc/<pid>/stat) appear alive to
// kill(0) but have already terminated — their corpse lingers until the parent
// calls wait(). In a Docker container the child may be reparented to a PID 1
// that does not reap promptly, so we treat zombies as not alive.
func processAlive(pid int) bool {
err := syscall.Kill(pid, 0)
if err != nil {
return false
}
// On Linux /proc is available; check whether the process is a zombie.
if b, readErr := os.ReadFile(fmt.Sprintf("/proc/%d/stat", pid)); readErr == nil {
// Format: "pid (comm) state ..." — state follows the closing ')' of the
// command name (which may itself contain spaces and parens).
rest := string(b)
if idx := strings.LastIndex(rest, ") "); idx >= 0 {
fields := strings.Fields(rest[idx+2:])
if len(fields) > 0 && fields[0] == "Z" {
return false // zombie: terminated but not yet reaped
}
}
}
return true
}
// TestProcessKillerKillsTree verifies that a process group captured by the
// killer is terminated together with a child the step spawns afterwards. This
// mirrors a step that launches a child which spawns further processes, where
// cancelling the job must take down the whole tree, not just the direct child.
func TestProcessKillerKillsTree(t *testing.T) {
dir := t.TempDir()
pidFile := filepath.Join(dir, "child.pid")
// Parent shell backgrounds a long-lived child (writing its PID to a file)
// and then sleeps. With job control off (non-interactive sh) the backgrounded
// child stays in the parent's process group, so the group kill must reach it.
script := fmt.Sprintf(`sleep 600 & echo $! > %q; sleep 600`, pidFile)
cmd := exec.Command("/bin/sh", "-c", script)
// Launch as its own process-group leader, exactly like a real step does (see
// getSysProcAttr), so the killer's PGID == the process PID.
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
require.NoError(t, cmd.Start())
t.Cleanup(func() {
_ = syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
_ = cmd.Wait()
})
killer, err := newProcessKiller(cmd.Process)
require.NoError(t, err)
defer killer.Close()
// Wait for the backgrounded child PID to be reported.
var childPID int
require.Eventually(t, func() bool {
b, e := os.ReadFile(pidFile)
if e != nil {
return false
}
s := strings.TrimSpace(string(b))
if s == "" {
return false
}
childPID, _ = strconv.Atoi(s)
return childPID > 0 && processAlive(childPID)
}, 20*time.Second, 100*time.Millisecond, "child process should start")
// Killing the group must terminate both the parent and the backgrounded child.
require.NoError(t, killer.Kill())
// Reap the parent so it does not linger as a zombie (which would still report
// as alive); SIGKILL makes Wait return promptly.
_ = cmd.Wait()
require.Eventually(t, func() bool {
return !processAlive(childPID)
}, 20*time.Second, 100*time.Millisecond, "backgrounded child should be terminated")
}