feat(metrics): expand with additional metrics#449
Open
jalet wants to merge 3 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR expands the operator’s observability surface by adding a broad set of Prometheus metrics (with OpenTelemetry emission for counters/histograms) spanning job lifecycle/latency, saturation, discovery, scheduler activity, PR outcomes, webhook integrity, and secret resolution. It also adds a scheduledAt timestamp to RenovateJob project status to measure queue wait time.
Changes:
- Add ~23 new metrics in
metricStore, with dual Prometheus + OTel emission for counters/histograms and Prometheus-only gauges. - Extend scheduler and webhook paths to emit new scheduler/webhook/security metrics.
- Add
scheduledAtto the RenovateJob CRD status and populate it when projects enter the Scheduled state, enabling queue-wait measurement.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/webhook/server.go | Adds webhook request/result metrics and auth-failure attribution for the schedule endpoint. |
| src/webhook/resolver.go | Introduces signatureWasUsed helper to attribute auth failures to signature vs token precedence. |
| src/webhook/gitlab.go | Adds webhook metrics (accepted/rejected/ignored/decode/auth) for GitLab handler. |
| src/webhook/github.go | Adds webhook metrics (accepted/rejected/ignored/decode/auth) for GitHub handler. |
| src/webhook/forgejo.go | Adds webhook metrics (accepted/rejected/ignored/decode/auth) for Forgejo handler. |
| src/ui/renovateController_test.go | Updates scheduler mock to match the new namespace/job-based scheduler API. |
| src/scheduler/scheduler.go | Changes scheduler API to namespace/job identifiers and emits scheduler run/next-run metrics. |
| src/scheduler/scheduler_test.go | Updates scheduler tests to match the new scheduler API. |
| src/metricStore/metrics.go | Adds new metric definitions, OTel mirrors, and helper functions for emitting metrics. |
| src/metricStore/metrics_test.go | Adds unit tests validating the new metric helpers and cleanup behavior. |
| src/internal/utils/projectStatus.go | Sets ScheduledAt when transitioning a project into Scheduled state. |
| src/internal/renovate/jobHelper.go | Extends job status helper to return numeric duration and adds failure-reason derivation. |
| src/internal/renovate/jobHelper_test.go | Updates tests for the extended getJobStatus return signature. |
| src/internal/renovate/executor.go | Emits saturation/queue/PR/log/job-lifecycle metrics and queue-wait measurement on dispatch. |
| src/internal/renovate/discoveryAgent.go | Emits discovery job completion/failure and discovered repo count metrics. |
| src/internal/crdManager/renovateJobManager.go | Records repository-filtering stats and secret-resolution error metrics. |
| src/gitProviderClients/projectFilter.go | Returns filter stats alongside filtered projects for metric emission. |
| src/gitProviderClients/projectFilter_test.go | Updates tests for new FilterProjects return signature. |
| src/controllers/renovatejob_controller.go | Updates scheduler calls to the new namespace/job-based API. |
| src/controllers/renovatejob_controller_test.go | Updates fake scheduler to match the new scheduler API. |
| src/api/v1alpha1/renovatejob_types.go | Adds scheduledAt to ProjectStatus and deep-copy logic. |
| docs/metrics.md | Documents the expanded metric set and provides example alert rules. |
| charts/renovate-operator/crd/renovate-operator.mogenius.com_renovatejobs.yaml | Updates CRD schema to include scheduledAt in status. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add ~33 operator-domain metrics on top of the existing five, grouped for two audiences: SRE (reliability, saturation, latency, outcomes) and SecOps/CISO (authn/authz, webhook integrity, secrets, transport posture). Foundation (metricStore): - Declare, register, and add typed helpers for all new metrics. - Dual-emit counters and histograms to Prometheus and OpenTelemetry; gauges stay Prometheus-only, matching the existing gauge convention. - Extend DeleteProjectMetrics to tear down all new per-project series. Wiring: - SRE: jobs dispatched, job duration, queue wait, failures by reason, saturation gauges + global parallelism limit, discovery counts, repositories filtered, schedule runs / next-run, log-issue counts. - Outcomes: open PRs, PRs created/merged/updated, awaiting approval, repositories by status, last execution duration. - SecOps: UI auth attempts/exchange/verification/state/session/loop/ unauthenticated, authz decisions and group filtering, webhook request/signature/auth/decode, secret resolution errors, OIDC TLS-posture gauge, Git-provider request/latency/rate-limit/TLS and filter fail-open. All security labels are bounded enums (no PII). Schema/signature changes to unblock three metrics: - Add ProjectStatus.ScheduledAt (CRD regenerated) to measure queue wait. - FilterProjects returns FilterStats so the caller can label repositories filtered by namespace/job. - Scheduler methods take namespace/job for correctly labeled schedule metrics; the internal key is unchanged. Document the full catalog and example alerting rules in docs/metrics.md. Signed-off-by: Joakim Jarsäter <joakim@jarsater.com>
…ility metrics Remove the SecOps groups that aren't needed, keeping the SRE set plus webhook integrity and secret-resolution metrics: - Removed (F) UI authentication, (G) authorization, (J) OIDC/Git-provider TLS posture, and (K) Git-provider API reliability metrics: declarations, OTel instruments, registrations, and helpers. - Reverted the ui package and Git-provider client instrumentation; deleted requestMetrics.go. projectFilter.go keeps the FilterStats return used by the repositories_filtered metric but drops the fail-open counter. - Kept: original five, plus job lifecycle (A), saturation (B), discovery (C), scheduler (D), log quality (E), outcomes (L), webhook integrity (H), and secret resolution (I). - Updated docs/metrics.md and unit tests to match. Signed-off-by: Joakim Jarsäter <joakim@jarsater.com>
Address Copilot review feedback on the repositories_by_status gauge: - Add NormalizeRepositoryStatus to map raw Renovate result strings to a bounded enum (disabled/no_config/onboarding/onboarding_closed/unknown/other), so an arbitrary finished.Result can no longer grow the metric's cardinality. - Reset the gauge each executor tick before repopulating, so a status whose count falls to zero (or a deleted job) no longer leaves a stale value. - Fix the misleading SetRepositoriesByStatus doc comment and the docs row to list the actual bounded status set. - Add tests for the normalizer and reset behavior. Signed-off-by: Joakim Jarsäter <joakim@jarsater.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds ~23 Prometheus/OTel metrics for SRE observability (job lifecycle, saturation, discovery, scheduler, PR outcomes) plus webhook-integrity and secret-resolution metrics.
Adds a
scheduledAtfield to the RenovateJob CRD status to measure queue wait.go build/vet/testpass; docs indocs/metrics.md.