Commit 92b78aa
committed
Reorganize layer cache Grafana dashboard for improved usability
Complete overhaul of the layer cache metrics dashboard to improve
readability, add missing operational metrics, and organize panels by
workflow.
Template Variables:
- Add operation template for filtering persistence latency metrics
- Convert backend template from hardcoded to dynamic query discovering
all backends (foyer, postgres, s3) automatically
- Expand service template to include read-only services (edda) using
union query across multiple metrics
Panel Organization:
Restructured 42 panels into 5 logical sections matching operational
workflows:
1. OVERVIEW & HEALTH - System health scorecard with error rates, queue
depths, resolution rates, and write throughput balance
2. READ PATH PERFORMANCE - Request latency, backend latency comparison,
cache miss rates, fallback tracking
3. WRITE PATH - S3 PERSISTENCE - Queue depth, backoff state, write
throughput, error breakdown, dead letter queue
4. WRITE PATH - POSTGRES PERSISTENCE - Persister operations,
persistence latency, retry queue depth and rates
5. EVICTION OPERATIONS - Eviction rates, failures, latency, and
memory-only evictions
Readability Improvements:
- All p50/p95/p99 panels now show exactly 3 data series using panel
repeat and aggregation
- Panel 26 (Persistence Latency): Repeats by operation, aggregates
other dimensions
- Panel 27 (Backend Read Latency): Aggregates across caches
- Panel 28 (Request Latency): Repeats by cache
- Panel 29 (Backend Hit Latency): Repeats by cache, shows all backends
together, removed miss series
New Metrics (8 panels):
- System Health Scorecard: Overall error rate, S3 queue depth, retry
queue depth with color-coded thresholds
- Fallback Tracking: S3→PostgreSQL fallback rate indicating data
availability issues
- S3 Write Error Breakdown: Stacked timeseries by error category
- S3 Dead Letter Queue Depth: Corrupted write tracking
- Write Throughput Balance: Compares S3 vs PostgreSQL write rates for
dual-write monitoring
- Memory-Only Evictions: Tracks Foyer memory evictions without disk
persistence
- Overall Cache Miss Rate: Single aggregated panel showing complete
misses per cache
Removed Redundant Panels (3):
- Retry Queue Success vs Failure Ratio (data visible in other panels)
- S3 Write Failures by Error Category (replaced by comprehensive
breakdown)
- Persister Retry Operations (superseded by specific retry queue
panels)
Technical Details:
- Overall cache miss uses layer_cache_request_latency_ms metric
(verified in layer_cache.rs:347-351) tracking complete misses across
all layers
- All backends explicitly report both hit and miss metrics (verified in
layer_cache.rs:154-308)
- Template union queries ensure all backends and services appear
dynamically1 parent d525e65 commit 92b78aa
File tree
1 file changed
+1016
-952
lines changed- dev/config/grafana/provisioning/dashboards
1 file changed
+1016
-952
lines changed
0 commit comments