mirror of
https://gitea.com/gitea/act_runner.git
synced 2026-06-15 14:24:22 +02:00
fix(cleanup): kill Unix step process group on cancel to avoid hang (#1025)
Cancelling a job on a Linux/macOS host runner can leave the spawned process tree running and hang the runner — the same failure mode fixed for Windows in #1011, just on the other platforms. Steps are launched as process-group leaders (`Setpgid`, or `Setsid` for the PTY path), but the default `exec.CommandContext` cancellation only kills the **direct child**. When a step launches a shell that starts a child which in turn spawns further background processes, cancelling the job leaves the descendants running. Because those orphans inherited the step's stdout/stderr pipe, the read end never hits EOF and `cmd.Wait()` blocks forever. Because the step executor never returns: - the orphaned processes keep running (the cancelled work is not actually stopped), and - end-of-job cleanup is never reached, so the runner appears to go offline / stop picking up jobs. ## Fix Apply the same tree-kill approach as Windows, using the Unix counterpart of a Job Object: the **process group**. - Add a Unix `processKiller` (`process_unix.go`) that captures the step's PGID (== PID, since the step is launched as a group leader) and sends `SIGKILL` to the whole group on cancellation. This also closes the inherited pipe handles so `cmd.Wait()` can return. `ESRCH` (group already gone) is not treated as an error. - Restrict the previous no-op stub (`process_other.go`) to `plan9` and have it fall back to a single-process kill, preserving plan9's prior behaviour. - Wire `cmd.Cancel` (tree kill) and `cmd.WaitDelay` (10s) **unconditionally** in `exec()` instead of Windows-only. `WaitDelay` also covers a step that backgrounds a process holding the pipe open after the main process exits. Reviewed-on: https://gitea.com/gitea/runner/pulls/1025 Reviewed-by: Zettat123 <39446+zettat123@noreply.gitea.com>
This commit is contained in:
56
act/container/process_unix.go
Normal file
56
act/container/process_unix.go
Normal file
@@ -0,0 +1,56 @@
|
||||
// Copyright 2026 The Gitea Authors. All rights reserved.
|
||||
// SPDX-License-Identifier: MIT
|
||||
|
||||
//go:build !windows && !plan9
|
||||
|
||||
package container
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"os"
|
||||
"syscall"
|
||||
)
|
||||
|
||||
// processKiller terminates a step process together with its whole process
|
||||
// group, which is the Unix counterpart of the Windows Job Object tree-kill.
|
||||
//
|
||||
// Background: a step often launches a process tree (a shell that starts a child
|
||||
// which in turn spawns further background processes). The default
|
||||
// exec.CommandContext cancellation only kills the direct child, so cancelling a
|
||||
// job left the rest of the tree running. Because those orphans inherited the
|
||||
// step's stdout/stderr pipe, cmd.Wait() also blocked forever and the runner
|
||||
// hung.
|
||||
//
|
||||
// Steps are started with Setpgid (or Setsid for the PTY path, see
|
||||
// getSysProcAttr), which makes the step process the leader of a new process
|
||||
// group whose ID equals its PID. Signalling the negative PID delivers to every
|
||||
// process still in that group, so we can tear down the whole tree atomically on
|
||||
// cancellation, which also closes the inherited pipe handles so cmd.Wait() can
|
||||
// return.
|
||||
type processKiller struct {
|
||||
pgid int
|
||||
}
|
||||
|
||||
// newProcessKiller captures the process group of p (an already-started
|
||||
// process). Because the step is launched with Setpgid/Setsid, p is a group
|
||||
// leader and its PGID equals its PID; children spawned afterwards stay in the
|
||||
// same group unless they explicitly create their own.
|
||||
func newProcessKiller(p *os.Process) (*processKiller, error) {
|
||||
return &processKiller{pgid: p.Pid}, nil
|
||||
}
|
||||
|
||||
// Kill sends SIGKILL to the entire process group (the step process and every
|
||||
// descendant that stayed in the group). A missing group (ESRCH) means the
|
||||
// processes already exited and is not treated as an error.
|
||||
func (k *processKiller) Kill() error {
|
||||
if k == nil || k.pgid <= 0 {
|
||||
return nil
|
||||
}
|
||||
if err := syscall.Kill(-k.pgid, syscall.SIGKILL); err != nil && !errors.Is(err, syscall.ESRCH) {
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Close is a no-op on Unix; there is no job handle to release.
|
||||
func (k *processKiller) Close() error { return nil }
|
||||
Reference in New Issue
Block a user