Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
256 changes: 256 additions & 0 deletions PROMETHEUS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
# Prometheus Metrics

DBLab Engine exposes Prometheus metrics via the `/metrics` endpoint. These metrics can be used to monitor the health and performance of the DBLab instance.

## Endpoint

```
GET /metrics
```

The endpoint is publicly accessible (no authentication required) and returns metrics in Prometheus text format.

## Available Metrics

### Instance Metrics

| Metric Name | Type | Labels | Description |
|-------------|------|--------|-------------|
| `dblab_instance_info` | Gauge | `instance_id`, `version`, `edition` | Information about the DBLab instance (always 1) |
| `dblab_instance_uptime_seconds` | Gauge | - | Time in seconds since the DBLab instance started |
| `dblab_instance_status_code` | Gauge | - | Status code of the DBLab instance (0=OK, 1=Warning, 2=Bad) |
| `dblab_retrieval_status` | Gauge | `mode`, `status` | Status of data retrieval (1=active for status) |

### Disk/Pool Metrics

| Metric Name | Type | Labels | Description |
|-------------|------|--------|-------------|
| `dblab_disk_total_bytes` | Gauge | `pool` | Total disk space in bytes |
| `dblab_disk_free_bytes` | Gauge | `pool` | Free disk space in bytes |
| `dblab_disk_used_bytes` | Gauge | `pool` | Used disk space in bytes |
| `dblab_disk_used_by_snapshots_bytes` | Gauge | `pool` | Disk space used by snapshots in bytes |
| `dblab_disk_used_by_clones_bytes` | Gauge | `pool` | Disk space used by clones in bytes |
| `dblab_disk_data_size_bytes` | Gauge | `pool` | Size of the data directory in bytes |
| `dblab_disk_compress_ratio` | Gauge | `pool` | Compression ratio of the filesystem (ZFS) |
| `dblab_pool_status` | Gauge | `pool`, `mode`, `status` | Status of the pool (1=active for status) |

### Clone Metrics (Aggregate)

| Metric Name | Type | Labels | Description |
|-------------|------|--------|-------------|
| `dblab_clones_total` | Gauge | - | Total number of clones |
| `dblab_clones_by_status` | Gauge | `status` | Number of clones by status |
| `dblab_clone_max_age_seconds` | Gauge | - | Maximum age of any clone in seconds |
| `dblab_clone_total_diff_size_bytes` | Gauge | - | Total extra disk space used by all clones (sum of diffs from snapshots) |
| `dblab_clone_total_logical_size_bytes` | Gauge | - | Total logical size of all clone data |
| `dblab_clone_total_cpu_usage_percent` | Gauge | - | Total CPU usage percentage across all clone containers |
| `dblab_clone_avg_cpu_usage_percent` | Gauge | - | Average CPU usage percentage across all clone containers with valid data |
| `dblab_clone_total_memory_usage_bytes` | Gauge | - | Total memory usage in bytes across all clone containers |
| `dblab_clone_total_memory_limit_bytes` | Gauge | - | Total memory limit in bytes across all clone containers |
| `dblab_clone_protected_count` | Gauge | - | Number of protected clones |

### Snapshot Metrics (Aggregate)

| Metric Name | Type | Labels | Description |
|-------------|------|--------|-------------|
| `dblab_snapshots_total` | Gauge | - | Total number of snapshots |
| `dblab_snapshots_by_pool` | Gauge | `pool` | Number of snapshots by pool |
| `dblab_snapshot_max_age_seconds` | Gauge | - | Maximum age of any snapshot in seconds |
| `dblab_snapshot_total_physical_size_bytes` | Gauge | - | Total physical disk space used by all snapshots |
| `dblab_snapshot_total_logical_size_bytes` | Gauge | - | Total logical size of all snapshot data |
| `dblab_snapshot_max_data_lag_seconds` | Gauge | - | Maximum data lag of any snapshot in seconds |
| `dblab_snapshot_total_num_clones` | Gauge | - | Total number of clones across all snapshots |

### Branch Metrics

| Metric Name | Type | Labels | Description |
|-------------|------|--------|-------------|
| `dblab_branches_total` | Gauge | - | Total number of branches |

### Dataset Metrics

| Metric Name | Type | Labels | Description |
|-------------|------|--------|-------------|
| `dblab_datasets_total` | Gauge | `pool` | Total number of datasets (slots) in the pool |
| `dblab_datasets_available` | Gauge | `pool` | Number of available (non-busy) dataset slots for reuse |

### Observability Metrics

These metrics help monitor the health of the metrics collection system itself.

| Metric Name | Type | Labels | Description |
|-------------|------|--------|-------------|
| `dblab_scrape_success_timestamp` | Gauge | - | Unix timestamp of last successful metrics collection |
| `dblab_scrape_duration_seconds` | Gauge | - | Duration of last metrics collection in seconds |
| `dblab_scrape_errors_total` | Counter | - | Total number of errors during metrics collection |

## Prometheus Configuration

Add the following to your `prometheus.yml`:

```yaml
scrape_configs:
- job_name: 'dblab'
static_configs:
- targets: ['<dblab-host>:<dblab-port>']
metrics_path: /metrics
```

## Example Queries

### Free Disk Space Percentage

```promql
100 * dblab_disk_free_bytes / dblab_disk_total_bytes
```

### Number of Active Clones

```promql
dblab_clones_total
```

### Maximum Clone Age in Hours

```promql
dblab_clone_max_age_seconds / 3600
```

### Data Freshness (lag from current time)

```promql
dblab_snapshot_max_data_lag_seconds / 60
```

### Total Memory Usage Across All Clones

```promql
dblab_clone_total_memory_usage_bytes
```

### Average CPU Usage Across All Clones

```promql
dblab_clone_avg_cpu_usage_percent
```

### Clones by Status

```promql
dblab_clones_by_status
```

### Metrics Collection Health

```promql
time() - dblab_scrape_success_timestamp
```

## Alerting Examples

### Low Disk Space Alert

```yaml
- alert: DBLabLowDiskSpace
expr: (dblab_disk_free_bytes / dblab_disk_total_bytes) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "DBLab low disk space"
description: "DBLab pool {{ $labels.pool }} has less than 20% free disk space"
```

### Stale Snapshot Alert

```yaml
- alert: DBLabStaleSnapshot
expr: dblab_snapshot_max_data_lag_seconds > 86400
for: 10m
labels:
severity: warning
annotations:
summary: "DBLab snapshot data is stale"
description: "DBLab snapshot data is more than 24 hours old"
```

### High Clone Count Alert

```yaml
- alert: DBLabHighCloneCount
expr: dblab_clones_total > 50
for: 5m
labels:
severity: warning
annotations:
summary: "DBLab has many clones"
description: "DBLab has {{ $value }} clones running"
```

### Metrics Collection Stale Alert

```yaml
- alert: DBLabMetricsStale
expr: time() - dblab_scrape_success_timestamp > 300
for: 5m
labels:
severity: warning
annotations:
summary: "DBLab metrics collection is stale"
description: "DBLab metrics have not been updated for more than 5 minutes"
```

## OpenTelemetry Integration

DBLab metrics can be exported to OpenTelemetry-compatible backends using the OpenTelemetry Collector. This allows you to send metrics to Grafana Cloud, Datadog, New Relic, and other observability platforms.

### Quick Start

1. Install the OpenTelemetry Collector:
```bash
# Using Docker
docker pull otel/opentelemetry-collector-contrib:latest
```

2. Copy the example configuration:
```bash
cp engine/configs/otel-collector.example.yml otel-collector.yml
```

3. Edit `otel-collector.yml` to configure your backend:
```yaml
exporters:
otlp:
endpoint: "your-otlp-endpoint:4317"
headers:
Authorization: "Bearer <your-token>"
```

4. Run the collector:
```bash
docker run -v $(pwd)/otel-collector.yml:/etc/otelcol/config.yaml \
-p 4317:4317 -p 8889:8889 \
otel/opentelemetry-collector-contrib:latest
```

### Architecture

```
┌─────────────┐ scrape ┌──────────────────┐ OTLP ┌─────────────┐
│ DBLab │ ──────────────► │ OTel Collector │ ──────────────► │ Backend │
│ /metrics │ :2345 │ │ :4317 │ (Grafana, │
└─────────────┘ └──────────────────┘ │ Datadog) │
└─────────────┘
```

### Supported Backends

The OTel Collector can export to:
- **Grafana Cloud** - Use OTLP exporter with Grafana Cloud endpoint
- **Datadog** - Use the datadog exporter
- **New Relic** - Use OTLP exporter with New Relic endpoint
- **Prometheus Remote Write** - Use prometheusremotewrite exporter
- **AWS CloudWatch** - Use awsemf exporter
- **Any OTLP-compatible backend**

See `engine/configs/otel-collector.example.yml` for a complete configuration example.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ Read more:
- Resource quotas: CPU, RAM
- Monitoring & security
- `/healthz` API endpoint (no auth), extended `/status` endpoint ([API docs](https://api.dblab.dev))
- Prometheus metrics endpoint (`/metrics`) for monitoring
- Netdata module for insights

## How to contribute
Expand Down
3 changes: 3 additions & 0 deletions engine/.gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,9 @@ build-binary-client-dev:
build-binary-client-master:
<<: *only_master
stage: build-binary
artifacts:
paths:
- engine/bin
script:
- cd engine
- make build-client
Expand Down
34 changes: 34 additions & 0 deletions engine/api/swagger-spec/dblab_openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1516,6 +1516,16 @@ components:
type: array
items:
$ref: '#/components/schemas/Clone'
protectionLeaseDurationMinutes:
type: integer
format: int64
minimum: 0
description: Default protection lease duration in minutes
protectionMaxDurationMinutes:
type: integer
format: int64
minimum: 0
description: Maximum allowed protection duration in minutes
Retrieving:
type: object
properties:
Expand Down Expand Up @@ -1631,6 +1641,10 @@ components:
protected:
type: boolean
default: false
protectedTill:
type: string
format: date-time
description: Time when the protection lease expires
deleteAt:
type: string
format: date-time
Expand Down Expand Up @@ -1658,6 +1672,16 @@ components:
maxIdleMinutes:
type: integer
format: int64
protectionLeaseDurationMinutes:
type: integer
format: int64
minimum: 0
description: Protection lease duration in minutes
protectionMaxDurationMinutes:
type: integer
format: int64
minimum: 0
description: Maximum allowed protection duration in minutes
CreateClone:
type: object
properties:
Expand All @@ -1673,6 +1697,11 @@ components:
protected:
type: boolean
default:
protectionDurationMinutes:
type: integer
format: int64
minimum: 0
description: Protection duration in minutes. 0 means forever, omit for default duration.
db:
type: object
properties:
Expand Down Expand Up @@ -1703,6 +1732,11 @@ components:
protected:
type: boolean
default: false
protectionDurationMinutes:
type: integer
format: int64
minimum: 0
description: Protection duration in minutes. 0 means forever, omit for default duration.
StartObservationRequest:
type: object
properties:
Expand Down
Loading
Loading