Skip to content

Conversation

@adityamaru
Copy link
Contributor

@adityamaru adityamaru commented Nov 16, 2025

Summary

Add metrics tracking for BoltDB integrity check failures only to monitor database corruption issues.

Changes

  • Added reportIntegrityCheckFailure function in reporter.ts that sends failure metrics to FA agent's internal metrics endpoint
  • Integrated metric reporting into checkBoltDbIntegrity function in main.ts
  • Only actual failures are reported (not successes, timeouts, or OOM cases)

Metric Details

The metric is sent to the FA agent's /internal endpoint and forwarded to Grafana via OpenTelemetry with a single attribute:

  • database_file: history.db or cache.db

Metric name: boltdb_integrity_check_failure
Value: Always 1 (for each failure)

Related PRs

Related Issue

Relates to BLA-2024: stickydisk,docker: dig into continued sticky disk failure

This provides clean visibility into which specific database files (history.db vs cache.db) are experiencing corruption issues.


Note

Adds BoltDB integrity-check failure reporting to the agent, runs checks before startup and during cleanup, and includes filesystem usage in sticky disk commits.

  • Integrity checks (BoltDB):
    • Scan *.db under /var/lib/buildkit; run bbolt check with memory/time limits; log durations and sizes.
    • Treat timeouts/OOM as non-fail; report actual failures via reportIntegrityCheckFailure (sends boltdb_integrity_check_failure with database_file).
    • Execute checks pre-start and again during cleanup (before unmount); then logDatabaseHashes.
  • Metrics & reporting:
    • New reportIntegrityCheckFailure(dbFile) in reporter.ts posts to agent /internal endpoint.
    • Continue reporting existing setup timings; improved debug/warnings.
  • Cleanup/commit enhancements:
    • Capture filesystem usage via df and pass fsDiskUsageBytes to commitStickyDisk (only when valid).
    • More robust unmount retries and process checks.

Written by Cursor Bugbot for commit 09980d2. This will update automatically on new commits. Configure here.

@linear
Copy link

linear bot commented Nov 16, 2025

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then we should turn that debug into a warning.

Add metrics tracking for BoltDB integrity check results to monitor
how many organizations are experiencing database corruption issues.

This metric is sent to the FA agent's internal metrics endpoint and
forwarded to Grafana via OpenTelemetry with the following attributes:
- database_file: history.db or cache.db
- result: passed, failed, timeout, or oom
- repo: repository name
- region: blacksmith region
- installation_id: organization installation ID
- duration_ms: check duration

The metric helps us understand the prevalence and nature of BoltDB
integrity issues across our customer base without relying on logs.
Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then we should turn that debug into a warning.

@adityamaru adityamaru merged commit 0ba9400 into main Nov 17, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants