|
| 1 | +# Metrics Instrumentation |
| 2 | + |
| 3 | +This document describes the metrics instrumentation available in amp for observability and monitoring. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Amp provides Prometheus-compatible metrics for monitoring data loading operations. The metrics module offers: |
| 8 | + |
| 9 | +- **Low overhead** instrumentation with optional prometheus_client dependency |
| 10 | +- **Graceful degradation** when prometheus_client is not installed |
| 11 | +- **Consistent naming** following Prometheus conventions |
| 12 | +- **Thread-safe** singleton implementation |
| 13 | + |
| 14 | +## Installation |
| 15 | + |
| 16 | +The metrics module works without prometheus_client (using no-op metrics), but to enable actual metric collection: |
| 17 | + |
| 18 | +```bash |
| 19 | +# Install with metrics support |
| 20 | +pip install amp[metrics] |
| 21 | + |
| 22 | +# Or install prometheus_client directly |
| 23 | +pip install prometheus-client |
| 24 | +``` |
| 25 | + |
| 26 | +## Quick Start |
| 27 | + |
| 28 | +```python |
| 29 | +from amp.metrics import get_metrics, start_metrics_server |
| 30 | + |
| 31 | +# Get the global metrics instance |
| 32 | +metrics = get_metrics() |
| 33 | + |
| 34 | +# Start HTTP server on port 8000 for Prometheus scraping |
| 35 | +start_metrics_server(port=8000) |
| 36 | + |
| 37 | +# Record metrics in your code |
| 38 | +metrics.records_processed.labels( |
| 39 | + loader='postgresql', |
| 40 | + table='users', |
| 41 | + connection='default' |
| 42 | +).inc(1000) |
| 43 | + |
| 44 | +metrics.processing_latency.labels( |
| 45 | + loader='postgresql', |
| 46 | + operation='load_batch' |
| 47 | +).observe(0.5) |
| 48 | +``` |
| 49 | + |
| 50 | +## Available Metrics |
| 51 | + |
| 52 | +### Counters |
| 53 | + |
| 54 | +| Metric | Labels | Description | |
| 55 | +|--------|--------|-------------| |
| 56 | +| `amp_records_processed_total` | loader, table, connection | Total records processed | |
| 57 | +| `amp_errors_total` | loader, error_type, table | Total errors by type | |
| 58 | +| `amp_bytes_processed_total` | loader, table | Total bytes processed | |
| 59 | +| `amp_reorg_events_total` | loader, network, table | Blockchain reorg events | |
| 60 | +| `amp_retry_attempts_total` | loader, operation, reason | Retry attempts | |
| 61 | + |
| 62 | +### Histograms |
| 63 | + |
| 64 | +| Metric | Labels | Description | |
| 65 | +|--------|--------|-------------| |
| 66 | +| `amp_processing_latency_seconds` | loader, operation | Processing time distribution | |
| 67 | +| `amp_batch_size_records` | loader, table | Batch size distribution | |
| 68 | + |
| 69 | +### Gauges |
| 70 | + |
| 71 | +| Metric | Labels | Description | |
| 72 | +|--------|--------|-------------| |
| 73 | +| `amp_active_connections` | loader, target | Current active connections | |
| 74 | +| `amp_queue_depth` | queue_name | Current queue depth | |
| 75 | + |
| 76 | +### Info |
| 77 | + |
| 78 | +| Metric | Labels | Description | |
| 79 | +|--------|--------|-------------| |
| 80 | +| `amp_build_info` | (various) | Build/version information | |
| 81 | + |
| 82 | +## Context Manager for Operations |
| 83 | + |
| 84 | +The `track_operation` context manager simplifies instrumentation: |
| 85 | + |
| 86 | +```python |
| 87 | +from amp.metrics import get_metrics |
| 88 | + |
| 89 | +metrics = get_metrics() |
| 90 | + |
| 91 | +with metrics.track_operation('postgresql', 'load_batch', table='users') as ctx: |
| 92 | + # Your loading code here |
| 93 | + rows_loaded = load_data(batch) |
| 94 | + |
| 95 | + # Set context for automatic metric recording |
| 96 | + ctx['records'] = rows_loaded |
| 97 | + ctx['bytes'] = batch.nbytes |
| 98 | + |
| 99 | +# Metrics are automatically recorded: |
| 100 | +# - processing_latency is observed |
| 101 | +# - records_processed is incremented |
| 102 | +# - bytes_processed is incremented |
| 103 | +# - errors are recorded if an exception occurs |
| 104 | +``` |
| 105 | + |
| 106 | +## Configuration |
| 107 | + |
| 108 | +Customize metrics collection with `MetricsConfig`: |
| 109 | + |
| 110 | +```python |
| 111 | +from amp.metrics import get_metrics, MetricsConfig |
| 112 | + |
| 113 | +config = MetricsConfig( |
| 114 | + enabled=True, # Enable/disable all metrics |
| 115 | + namespace='amp', # Metric name prefix |
| 116 | + subsystem='loader', # Optional subsystem name |
| 117 | + default_labels={'env': 'prod'}, # Default labels for all metrics |
| 118 | + histogram_buckets=( # Custom latency buckets |
| 119 | + 0.001, 0.005, 0.01, 0.025, 0.05, |
| 120 | + 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0 |
| 121 | + ), |
| 122 | +) |
| 123 | + |
| 124 | +metrics = get_metrics(config) |
| 125 | +``` |
| 126 | + |
| 127 | +## Prometheus Integration |
| 128 | + |
| 129 | +### HTTP Endpoint |
| 130 | + |
| 131 | +Start a metrics server for Prometheus scraping: |
| 132 | + |
| 133 | +```python |
| 134 | +from amp.metrics import start_metrics_server |
| 135 | + |
| 136 | +# Start on default port 8000 |
| 137 | +start_metrics_server() |
| 138 | + |
| 139 | +# Or specify custom port and address |
| 140 | +start_metrics_server(port=9090, addr='0.0.0.0') |
| 141 | +``` |
| 142 | + |
| 143 | +### Generate Metrics Text |
| 144 | + |
| 145 | +Generate metrics in Prometheus text format for custom export: |
| 146 | + |
| 147 | +```python |
| 148 | +from amp.metrics import generate_metrics_text |
| 149 | + |
| 150 | +# Get metrics as bytes |
| 151 | +metrics_text = generate_metrics_text() |
| 152 | + |
| 153 | +# Use in your HTTP handler |
| 154 | +@app.route('/metrics') |
| 155 | +def metrics_endpoint(): |
| 156 | + return Response(generate_metrics_text(), mimetype='text/plain') |
| 157 | +``` |
| 158 | + |
| 159 | +### Example Prometheus Config |
| 160 | + |
| 161 | +```yaml |
| 162 | +scrape_configs: |
| 163 | + - job_name: 'amp' |
| 164 | + static_configs: |
| 165 | + - targets: ['localhost:8000'] |
| 166 | + scrape_interval: 15s |
| 167 | +``` |
| 168 | +
|
| 169 | +## Grafana Dashboard |
| 170 | +
|
| 171 | +Example queries for a Grafana dashboard: |
| 172 | +
|
| 173 | +```promql |
| 174 | +# Records processed rate (per second) |
| 175 | +rate(amp_records_processed_total[5m]) |
| 176 | + |
| 177 | +# P99 latency |
| 178 | +histogram_quantile(0.99, rate(amp_processing_latency_seconds_bucket[5m])) |
| 179 | + |
| 180 | +# Error rate percentage |
| 181 | +rate(amp_errors_total[5m]) / rate(amp_records_processed_total[5m]) * 100 |
| 182 | + |
| 183 | +# Active connections by loader |
| 184 | +amp_active_connections |
| 185 | + |
| 186 | +# Average batch size |
| 187 | +rate(amp_batch_size_records_sum[5m]) / rate(amp_batch_size_records_count[5m]) |
| 188 | +``` |
| 189 | + |
| 190 | +## Graceful Degradation |
| 191 | + |
| 192 | +When prometheus_client is not installed, the metrics module uses no-op implementations that silently accept all operations: |
| 193 | + |
| 194 | +```python |
| 195 | +from amp.metrics import get_metrics, is_prometheus_available |
| 196 | + |
| 197 | +if is_prometheus_available(): |
| 198 | + print("Prometheus metrics enabled") |
| 199 | +else: |
| 200 | + print("Metrics disabled - install prometheus-client to enable") |
| 201 | + |
| 202 | +# Code works the same either way |
| 203 | +metrics = get_metrics() |
| 204 | +metrics.records_processed.labels(loader='test', table='t', connection='c').inc(100) |
| 205 | +``` |
| 206 | + |
| 207 | +## Testing |
| 208 | + |
| 209 | +For testing, you can reset the metrics singleton: |
| 210 | + |
| 211 | +```python |
| 212 | +from amp.metrics import AmpMetrics |
| 213 | + |
| 214 | +def test_my_loader(): |
| 215 | + # Reset before test |
| 216 | + AmpMetrics.reset_instance() |
| 217 | + |
| 218 | + # Run test with fresh metrics |
| 219 | + metrics = get_metrics() |
| 220 | + # ... |
| 221 | + |
| 222 | + # Clean up after test |
| 223 | + AmpMetrics.reset_instance() |
| 224 | +``` |
| 225 | + |
| 226 | +## Best Practices |
| 227 | + |
| 228 | +1. **Use consistent labels** - Keep label values consistent across your codebase |
| 229 | +2. **Avoid high cardinality** - Don't use user IDs or request IDs as labels |
| 230 | +3. **Use track_operation** - Prefer the context manager for automatic error handling |
| 231 | +4. **Set up alerts** - Configure Prometheus alerts for error rates and latency |
| 232 | +5. **Dashboard first** - Design your metrics around what you want to see in dashboards |
0 commit comments