Skip to content

feat(metrics): expand with additional metrics#449

Open
jalet wants to merge 3 commits into
mogenius:mainfrom
jalet:main
Open

feat(metrics): expand with additional metrics#449
jalet wants to merge 3 commits into
mogenius:mainfrom
jalet:main

Conversation

@jalet

@jalet jalet commented Jun 27, 2026

Copy link
Copy Markdown

Adds ~23 Prometheus/OTel metrics for SRE observability (job lifecycle, saturation, discovery, scheduler, PR outcomes) plus webhook-integrity and secret-resolution metrics.

Adds a scheduledAt field to the RenovateJob CRD status to measure queue wait. go build/vet/test pass; docs in docs/metrics.md.

@jalet jalet marked this pull request as ready for review July 3, 2026 09:07
Copilot AI review requested due to automatic review settings July 3, 2026 09:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the operator’s observability surface by adding a broad set of Prometheus metrics (with OpenTelemetry emission for counters/histograms) spanning job lifecycle/latency, saturation, discovery, scheduler activity, PR outcomes, webhook integrity, and secret resolution. It also adds a scheduledAt timestamp to RenovateJob project status to measure queue wait time.

Changes:

  • Add ~23 new metrics in metricStore, with dual Prometheus + OTel emission for counters/histograms and Prometheus-only gauges.
  • Extend scheduler and webhook paths to emit new scheduler/webhook/security metrics.
  • Add scheduledAt to the RenovateJob CRD status and populate it when projects enter the Scheduled state, enabling queue-wait measurement.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/webhook/server.go Adds webhook request/result metrics and auth-failure attribution for the schedule endpoint.
src/webhook/resolver.go Introduces signatureWasUsed helper to attribute auth failures to signature vs token precedence.
src/webhook/gitlab.go Adds webhook metrics (accepted/rejected/ignored/decode/auth) for GitLab handler.
src/webhook/github.go Adds webhook metrics (accepted/rejected/ignored/decode/auth) for GitHub handler.
src/webhook/forgejo.go Adds webhook metrics (accepted/rejected/ignored/decode/auth) for Forgejo handler.
src/ui/renovateController_test.go Updates scheduler mock to match the new namespace/job-based scheduler API.
src/scheduler/scheduler.go Changes scheduler API to namespace/job identifiers and emits scheduler run/next-run metrics.
src/scheduler/scheduler_test.go Updates scheduler tests to match the new scheduler API.
src/metricStore/metrics.go Adds new metric definitions, OTel mirrors, and helper functions for emitting metrics.
src/metricStore/metrics_test.go Adds unit tests validating the new metric helpers and cleanup behavior.
src/internal/utils/projectStatus.go Sets ScheduledAt when transitioning a project into Scheduled state.
src/internal/renovate/jobHelper.go Extends job status helper to return numeric duration and adds failure-reason derivation.
src/internal/renovate/jobHelper_test.go Updates tests for the extended getJobStatus return signature.
src/internal/renovate/executor.go Emits saturation/queue/PR/log/job-lifecycle metrics and queue-wait measurement on dispatch.
src/internal/renovate/discoveryAgent.go Emits discovery job completion/failure and discovered repo count metrics.
src/internal/crdManager/renovateJobManager.go Records repository-filtering stats and secret-resolution error metrics.
src/gitProviderClients/projectFilter.go Returns filter stats alongside filtered projects for metric emission.
src/gitProviderClients/projectFilter_test.go Updates tests for new FilterProjects return signature.
src/controllers/renovatejob_controller.go Updates scheduler calls to the new namespace/job-based API.
src/controllers/renovatejob_controller_test.go Updates fake scheduler to match the new scheduler API.
src/api/v1alpha1/renovatejob_types.go Adds scheduledAt to ProjectStatus and deep-copy logic.
docs/metrics.md Documents the expanded metric set and provides example alert rules.
charts/renovate-operator/crd/renovate-operator.mogenius.com_renovatejobs.yaml Updates CRD schema to include scheduledAt in status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jalet jalet marked this pull request as draft July 3, 2026 09:14
jalet added 3 commits July 3, 2026 11:15
Add ~33 operator-domain metrics on top of the existing five, grouped for
two audiences: SRE (reliability, saturation, latency, outcomes) and
SecOps/CISO (authn/authz, webhook integrity, secrets, transport posture).

Foundation (metricStore):
- Declare, register, and add typed helpers for all new metrics.
- Dual-emit counters and histograms to Prometheus and OpenTelemetry; gauges
  stay Prometheus-only, matching the existing gauge convention.
- Extend DeleteProjectMetrics to tear down all new per-project series.

Wiring:
- SRE: jobs dispatched, job duration, queue wait, failures by reason,
  saturation gauges + global parallelism limit, discovery counts, repositories
  filtered, schedule runs / next-run, log-issue counts.
- Outcomes: open PRs, PRs created/merged/updated, awaiting approval,
  repositories by status, last execution duration.
- SecOps: UI auth attempts/exchange/verification/state/session/loop/
  unauthenticated, authz decisions and group filtering, webhook
  request/signature/auth/decode, secret resolution errors, OIDC TLS-posture
  gauge, Git-provider request/latency/rate-limit/TLS and filter fail-open.
  All security labels are bounded enums (no PII).

Schema/signature changes to unblock three metrics:
- Add ProjectStatus.ScheduledAt (CRD regenerated) to measure queue wait.
- FilterProjects returns FilterStats so the caller can label repositories
  filtered by namespace/job.
- Scheduler methods take namespace/job for correctly labeled schedule metrics;
  the internal key is unchanged.

Document the full catalog and example alerting rules in docs/metrics.md.

Signed-off-by: Joakim Jarsäter <joakim@jarsater.com>
…ility metrics

Remove the SecOps groups that aren't needed, keeping the SRE set plus webhook
integrity and secret-resolution metrics:

- Removed (F) UI authentication, (G) authorization, (J) OIDC/Git-provider TLS
  posture, and (K) Git-provider API reliability metrics: declarations, OTel
  instruments, registrations, and helpers.
- Reverted the ui package and Git-provider client instrumentation; deleted
  requestMetrics.go. projectFilter.go keeps the FilterStats return used by the
  repositories_filtered metric but drops the fail-open counter.
- Kept: original five, plus job lifecycle (A), saturation (B), discovery (C),
  scheduler (D), log quality (E), outcomes (L), webhook integrity (H), and
  secret resolution (I).
- Updated docs/metrics.md and unit tests to match.

Signed-off-by: Joakim Jarsäter <joakim@jarsater.com>
Address Copilot review feedback on the repositories_by_status gauge:

- Add NormalizeRepositoryStatus to map raw Renovate result strings to a bounded
  enum (disabled/no_config/onboarding/onboarding_closed/unknown/other), so an
  arbitrary finished.Result can no longer grow the metric's cardinality.
- Reset the gauge each executor tick before repopulating, so a status whose count
  falls to zero (or a deleted job) no longer leaves a stale value.
- Fix the misleading SetRepositoriesByStatus doc comment and the docs row to list
  the actual bounded status set.
- Add tests for the normalizer and reset behavior.

Signed-off-by: Joakim Jarsäter <joakim@jarsater.com>
@jalet jalet marked this pull request as ready for review July 3, 2026 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants