feat: add Prometheus metrics endpoint for runner observability (#820)

## What Add an optional Prometheus `/metrics` HTTP endpoint to `act_runner` so operators can observe runner health, polling behavior, job outcomes, and RPC latency without scraping logs. New surface: - `internal/pkg/metrics/metrics.go` — metric definitions, custom `Registry`, static Go/process collectors, label constants, `ResultToStatusLabel` helper. - `internal/pkg/metrics/server.go` — hardened `http.Server` serving `/metrics` and `/healthz` with Slowloris-safe timeouts (`ReadHeaderTimeout` 5s, `ReadTimeout`/`WriteTimeout` 10s, `IdleTimeout` 60s) and a 5s graceful shutdown. - `daemon.go` wires it up behind `cfg.Metrics.Enabled` (disabled by default). - `poller.go` / `reporter.go` / `runner.go` instrument their existing hot paths with counters/histograms/gauges — no behavior change. Metrics exported (namespace `act_runner_`): | Subsystem | Metric | Type | Labels | |---|---|---|---| | — | `info` | Gauge | `version`, `name` | | — | `capacity`, `uptime_seconds` | Gauge | — | | `poll` | `fetch_total`, `client_errors_total` | Counter | `result` / `method` | | `poll` | `fetch_duration_seconds`, `backoff_seconds` | Histogram / Gauge | — | | `job` | `total` | Counter | `status` | | `job` | `duration_seconds`, `running`, `capacity_utilization_ratio` | Histogram / GaugeFunc | — | | `report` | `log_total`, `state_total` | Counter | `result` | | `report` | `log_duration_seconds`, `state_duration_seconds` | Histogram | — | | `report` | `log_buffer_rows` | Gauge | — | | — | `go_*`, `process_*` | standard collectors | — | All label values are predefined constants — **no high-cardinality labels** (no task IDs, repo URLs, branches, tokens, or secrets) so scraping is safe and bounded. ## Why Teams self-hosting Gitea + `act_runner` at scale need to answer basic SRE questions that are currently invisible: - How often are RPCs failing? Which RPC? (`act_runner_client_errors_total`) - Are runners saturated? (`act_runner_job_capacity_utilization_ratio`, `act_runner_job_running`) - How long do jobs take? (`act_runner_job_duration_seconds`) - Is polling backing off? (`act_runner_poll_backoff_seconds`, `act_runner_poll_fetch_total{result=\"error\"}`) - Are log/state reports slow? (`act_runner_report_{log,state}_duration_seconds`) - Is the log buffer draining? (`act_runner_report_log_buffer_rows`) Today operators have to grep logs. This PR makes all of the above first-class metrics so they can feed dashboards and alerts (`rate(act_runner_client_errors_total[5m]) > 0.1`, capacity saturation alerts, etc.). The endpoint is **disabled by default** and binds to `127.0.0.1:9101` when enabled, so it's opt-in and safe for existing deployments. ## How ### Config ```yaml metrics: enabled: false # opt-in addr: 127.0.0.1:9101 # change to 0.0.0.0:9101 only behind a reverse proxy ``` `config.example.yaml` documents both fields plus a security note about binding externally without auth. ### Wiring 1. `daemon.go` calls `metrics.Init()` (guarded by `sync.Once`), sets `act_runner_info`, `act_runner_capacity`, registers uptime + running-jobs GaugeFuncs, then starts the server goroutine with the daemon context — it shuts down cleanly on `ctx.Done()`. 2. `poller.fetchTask` observes RPC latency / result / error counters. `DeadlineExceeded` (long-poll idle) is treated as an empty result and **not** observed into the histogram so the 5s timeout doesn't swamp the buckets. 3. `poller.pollOnce` reports `poll_backoff_seconds` using the pre-jitter base interval (the true backoff level), and only when it changes — prevents noisy no-op gauge updates at the `FetchIntervalMax` plateau. 4. `reporter.ReportLog` / `ReportState` record duration histograms and success/error counters; `log_buffer_rows` is updated only when the value changes, guarded by the already-held `clientM`. 5. `runner.Run` observes `job_duration_seconds` and increments `job_total` by outcome via `metrics.ResultToStatusLabel`. ### Safety / security review - All timeouts set; Slowloris-safe. - Custom `prometheus.NewRegistry()` — no global registration side-effects. - No sensitive data in labels (reviewed every instrumentation site). - Single new dependency: `github.com/prometheus/client_golang v1.23.2`. - Endpoint is unauthenticated by design and documented as such; default localhost bind mitigates exposure. Operators exposing externally should front it with a reverse proxy. ## Verification ### Unit tests \`\`\`bash go build ./... go vet ./... go test ./... \`\`\` ### Manual smoke test 1. Enable metrics in `config.yaml`: \`\`\`yaml metrics: enabled: true addr: 127.0.0.1:9101 \`\`\` 2. Start the runner against a Gitea instance: \`./act_runner daemon\`. 3. Scrape the endpoint: \`\`\`bash curl -s http://127.0.0.1:9101/metrics | grep '^act_runner_' curl -s http://127.0.0.1:9101/healthz # → ok \`\`\` 4. Confirm the static series appear immediately: \`act_runner_info\`, \`act_runner_capacity\`, \`act_runner_uptime_seconds\`, \`act_runner_job_running\`, \`act_runner_job_capacity_utilization_ratio\`. 5. Trigger a workflow and confirm counters increment: \`act_runner_poll_fetch_total{result=\"task\"}\`, \`act_runner_job_total{status=\"success\"}\`, \`act_runner_report_log_total{result=\"success\"}\`. 6. Leave the runner idle and confirm \`act_runner_poll_backoff_seconds\` settles (and does **not** churn on every poll). 7. Ctrl-C and confirm a clean \"metrics server shutdown\" log line (no port-in-use error on restart within 5s). ### Prometheus integration Add to \`prometheus.yml\`: \`\`\`yaml scrape_configs: - job_name: act_runner static_configs: - targets: ['127.0.0.1:9101'] \`\`\` Sample alert to try: \`\`\` sum(rate(act_runner_client_errors_total[5m])) by (method) > 0.1 \`\`\` ## Out of scope (follow-ups) - TLS and auth on the metrics endpoint (mitigated today by localhost default; add when operators need external scraping). - Per-task labels (intentionally avoided for cardinality safety). --- 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://gitea.com/gitea/act_runner/pulls/820 Reviewed-by: Lunny Xiao <xiaolunwen@gmail.com> Co-authored-by: Bo-Yi Wu <appleboy.tw@gmail.com> Co-committed-by: Bo-Yi Wu <appleboy.tw@gmail.com>
2026-05-07 15:53:24 +02:00 · 2026-04-15 01:27:34 +00:00
parent f2d545565f
commit f33e5a6245
10 changed files with 393 additions and 4 deletions
--- a/internal/app/cmd/daemon.go
+++ b/internal/app/cmd/daemon.go
@@ -27,6 +27,7 @@ import (
 	"gitea.com/gitea/act_runner/internal/pkg/config"
 	"gitea.com/gitea/act_runner/internal/pkg/envcheck"
 	"gitea.com/gitea/act_runner/internal/pkg/labels"
+	"gitea.com/gitea/act_runner/internal/pkg/metrics"
 	"gitea.com/gitea/act_runner/internal/pkg/ver"
 )

@@ -149,6 +150,15 @@ func runDaemon(ctx context.Context, daemArgs *daemonArgs, configFile *string) fu
 				resp.Msg.Runner.Name, resp.Msg.Runner.Version, resp.Msg.Runner.Labels)
 		}

+		if cfg.Metrics.Enabled {
+			metrics.Init()
+			metrics.RunnerInfo.WithLabelValues(ver.Version(), resp.Msg.Runner.Name).Set(1)
+			metrics.RunnerCapacity.Set(float64(cfg.Runner.Capacity))
+			metrics.RegisterUptimeFunc(time.Now())
+			metrics.RegisterRunningJobsFunc(runner.RunningCount, cfg.Runner.Capacity)
+			metrics.StartServer(ctx, cfg.Metrics.Addr)
+		}
+
 		poller := poll.New(cfg, cli, runner)

 		if daemArgs.Once || reg.Ephemeral {
--- a/internal/app/poll/poller.go
+++ b/internal/app/poll/poller.go
@@ -19,6 +19,7 @@ import (
 	"gitea.com/gitea/act_runner/internal/app/run"
 	"gitea.com/gitea/act_runner/internal/pkg/client"
 	"gitea.com/gitea/act_runner/internal/pkg/config"
+	"gitea.com/gitea/act_runner/internal/pkg/metrics"
 )

 type Poller struct {
@@ -43,6 +44,10 @@ type Poller struct {
 type workerState struct {
 	consecutiveEmpty  int64
 	consecutiveErrors int64
+	// lastBackoff is the last interval reported to the PollBackoffSeconds gauge
+	// from this worker; used to suppress redundant no-op Set calls when the
+	// backoff plateaus (e.g. at FetchIntervalMax).
+	lastBackoff time.Duration
 }

 func New(cfg *config.Config, client client.Client, runner *run.Runner) *Poller {
@@ -166,8 +171,12 @@ func (p *Poller) pollOnce(s *workerState) {
 	for {
 		task, ok := p.fetchTask(p.pollingCtx, s)
 		if !ok {
-			interval := addJitter(p.calculateInterval(s))
-			timer := time.NewTimer(interval)
+			base := p.calculateInterval(s)
+			if base != s.lastBackoff {
+				metrics.PollBackoffSeconds.Set(base.Seconds())
+				s.lastBackoff = base
+			}
+			timer := time.NewTimer(addJitter(base))
 			select {
 			case <-timer.C:
 			case <-p.pollingCtx.Done():
@@ -205,15 +214,27 @@ func (p *Poller) fetchTask(ctx context.Context, s *workerState) (*runnerv1.Task,

 	// Load the version value that was in the cache when the request was sent.
 	v := p.tasksVersion.Load()
+	start := time.Now()
 	resp, err := p.client.FetchTask(reqCtx, connect.NewRequest(&runnerv1.FetchTaskRequest{
 		TasksVersion: v,
 	}))
+
+	// DeadlineExceeded is the designed idle path for a long-poll: the server
+	// found no work within FetchTimeout. Treat it as an empty response and do
+	// not record the duration — the timeout value would swamp the histogram.
 	if errors.Is(err, context.DeadlineExceeded) {
-		err = nil
+		s.consecutiveEmpty++
+		s.consecutiveErrors = 0 // timeout is a healthy idle response
+		metrics.PollFetchTotal.WithLabelValues(metrics.LabelResultEmpty).Inc()
+		return nil, false
 	}
+	metrics.PollFetchDuration.Observe(time.Since(start).Seconds())
+
 	if err != nil {
 		log.WithError(err).Error("failed to fetch task")
 		s.consecutiveErrors++
+		metrics.PollFetchTotal.WithLabelValues(metrics.LabelResultError).Inc()
+		metrics.ClientErrors.WithLabelValues(metrics.LabelMethodFetchTask).Inc()
 		return nil, false
 	}

@@ -222,6 +243,7 @@ func (p *Poller) fetchTask(ctx context.Context, s *workerState) (*runnerv1.Task,

 	if resp == nil || resp.Msg == nil {
 		s.consecutiveEmpty++
+		metrics.PollFetchTotal.WithLabelValues(metrics.LabelResultEmpty).Inc()
 		return nil, false
 	}

@@ -231,11 +253,13 @@ func (p *Poller) fetchTask(ctx context.Context, s *workerState) (*runnerv1.Task,

 	if resp.Msg.Task == nil {
 		s.consecutiveEmpty++
+		metrics.PollFetchTotal.WithLabelValues(metrics.LabelResultEmpty).Inc()
 		return nil, false
 	}

-	// got a task, set `tasksVersion` to zero to focre query db in next request.
+	// got a task, set `tasksVersion` to zero to force query db in next request.
 	p.tasksVersion.CompareAndSwap(resp.Msg.TasksVersion, 0)

+	metrics.PollFetchTotal.WithLabelValues(metrics.LabelResultTask).Inc()
 	return resp.Msg.Task, true
 }
--- a/internal/app/run/runner.go
+++ b/internal/app/run/runner.go
@@ -12,6 +12,7 @@ import (
 	"path/filepath"
 	"strings"
 	"sync"
+	"sync/atomic"
 	"time"

 	runnerv1 "code.gitea.io/actions-proto-go/runner/v1"
@@ -26,6 +27,7 @@ import (
 	"gitea.com/gitea/act_runner/internal/pkg/client"
 	"gitea.com/gitea/act_runner/internal/pkg/config"
 	"gitea.com/gitea/act_runner/internal/pkg/labels"
+	"gitea.com/gitea/act_runner/internal/pkg/metrics"
 	"gitea.com/gitea/act_runner/internal/pkg/report"
 	"gitea.com/gitea/act_runner/internal/pkg/ver"
 )
@@ -41,6 +43,7 @@ type Runner struct {
 	envs   map[string]string

 	runningTasks sync.Map
+	runningCount atomic.Int64
 }

 func NewRunner(cfg *config.Config, reg *config.Registration, cli client.Client) *Runner {
@@ -96,16 +99,25 @@ func (r *Runner) Run(ctx context.Context, task *runnerv1.Task) error {
 	r.runningTasks.Store(task.Id, struct{}{})
 	defer r.runningTasks.Delete(task.Id)

+	r.runningCount.Add(1)
+
+	start := time.Now()
+
 	ctx, cancel := context.WithTimeout(ctx, r.cfg.Runner.Timeout)
 	defer cancel()
 	reporter := report.NewReporter(ctx, cancel, r.client, task, r.cfg)
 	var runErr error
 	defer func() {
+		r.runningCount.Add(-1)
+
 		lastWords := ""
 		if runErr != nil {
 			lastWords = runErr.Error()
 		}
 		_ = reporter.Close(lastWords)
+
+		metrics.JobDuration.Observe(time.Since(start).Seconds())
+		metrics.JobsTotal.WithLabelValues(metrics.ResultToStatusLabel(reporter.Result())).Inc()
 	}()
 	reporter.RunDaemon()
 	runErr = r.run(ctx, task, reporter)
@@ -266,6 +278,10 @@ func (r *Runner) run(ctx context.Context, task *runnerv1.Task, reporter *report.
 	return execErr
 }

+func (r *Runner) RunningCount() int64 {
+	return r.runningCount.Load()
+}
+
 func (r *Runner) Declare(ctx context.Context, labels []string) (*connect.Response[runnerv1.DeclareResponse], error) {
 	return r.client.Declare(ctx, connect.NewRequest(&runnerv1.DeclareRequest{
 		Version: ver.Version(),