fix(host): bound host-environment cleanup and reclaim leaked scratch dirs (#1024)

Fixes #1023.

## Problem
In Windows host mode, a single stalled delete syscall (AV/EDR filter driver, unresponsive mount, dying disk) wedged the job forever at `Cleaning up container`. `HostEnvironment.Remove()` bounds every teardown phase (`terminateRunningProcesses`, both `removePathWithRetry` calls) except the `CleanUp` callback — an unbounded `os.RemoveAll(miscpath)` assigned in `startHostEnvironment`. The runner then held its capacity slot indefinitely, the task was reaped as a zombie, and there were no diagnostics.

## Fix
- **Bound the cleanup (availability):** `Remove()` now runs `CleanUp` under `hostCleanupTimeout` (30s) via `runWithTimeout`; on timeout it logs a warning and continues job completion. The stuck goroutine is left to finish (a delete syscall can't be interrupted). Added debug logs around the phase.
- **Reclaim the leak (disk hygiene):** a timed-out cleanup can leave a scratch dir behind, so the existing idle stale-dir sweep is extended to also remove orphaned host-mode scratch dirs (16-hex names) under `Host.WorkdirParent`, leaving the shared `tool_cache` and operator data untouched. The `bind_workdir` gate is dropped from `shouldRunIdleCleanup` so host-mode runners run the sweep.

Reviewed-on: https://gitea.com/gitea/runner/pulls/1024
Reviewed-by: Lunny Xiao <xiaolunwen@gmail.com>
This commit is contained in:
Nicolas
2026-06-14 14:14:43 +00:00
parent 56979e6ab8
commit 33e6d1d8ff
6 changed files with 302 additions and 31 deletions

View File

@@ -40,11 +40,12 @@ runner:
# The runner uses exponential backoff when idle, increasing the interval up to this maximum.
# Set to 0 or same as fetch_interval to disable backoff.
fetch_interval_max: 5s
# While idle, remove stale bind-workdir task directories older than this duration.
# Setting either workdir_cleanup_age or idle_cleanup_interval to 0 (or any
# non-positive value) disables workdir cleanup entirely.
# While idle, remove stale bind-workdir task directories and orphaned host-mode
# scratch directories (left behind when a host cleanup delete stalls) older than
# this duration. Setting either workdir_cleanup_age or idle_cleanup_interval to 0
# (or any non-positive value) disables stale-directory cleanup entirely.
workdir_cleanup_age: 24h
# Cadence for the idle stale bind-workdir cleanup pass.
# Cadence for the idle stale-directory cleanup pass.
idle_cleanup_interval: 10m
# The base interval for periodic log flush to the Gitea instance.
# Logs may be sent earlier if the buffer reaches log_report_batch_size