Commit 92b78aa

committed

Reorganize layer cache Grafana dashboard for improved usability

Complete overhaul of the layer cache metrics dashboard to improve readability, add missing operational metrics, and organize panels by workflow. Template Variables: - Add operation template for filtering persistence latency metrics - Convert backend template from hardcoded to dynamic query discovering all backends (foyer, postgres, s3) automatically - Expand service template to include read-only services (edda) using union query across multiple metrics Panel Organization: Restructured 42 panels into 5 logical sections matching operational workflows: 1. OVERVIEW & HEALTH - System health scorecard with error rates, queue depths, resolution rates, and write throughput balance 2. READ PATH PERFORMANCE - Request latency, backend latency comparison, cache miss rates, fallback tracking 3. WRITE PATH - S3 PERSISTENCE - Queue depth, backoff state, write throughput, error breakdown, dead letter queue 4. WRITE PATH - POSTGRES PERSISTENCE - Persister operations, persistence latency, retry queue depth and rates 5. EVICTION OPERATIONS - Eviction rates, failures, latency, and memory-only evictions Readability Improvements: - All p50/p95/p99 panels now show exactly 3 data series using panel repeat and aggregation - Panel 26 (Persistence Latency): Repeats by operation, aggregates other dimensions - Panel 27 (Backend Read Latency): Aggregates across caches - Panel 28 (Request Latency): Repeats by cache - Panel 29 (Backend Hit Latency): Repeats by cache, shows all backends together, removed miss series New Metrics (8 panels): - System Health Scorecard: Overall error rate, S3 queue depth, retry queue depth with color-coded thresholds - Fallback Tracking: S3→PostgreSQL fallback rate indicating data availability issues - S3 Write Error Breakdown: Stacked timeseries by error category - S3 Dead Letter Queue Depth: Corrupted write tracking - Write Throughput Balance: Compares S3 vs PostgreSQL write rates for dual-write monitoring - Memory-Only Evictions: Tracks Foyer memory evictions without disk persistence - Overall Cache Miss Rate: Single aggregated panel showing complete misses per cache Removed Redundant Panels (3): - Retry Queue Success vs Failure Ratio (data visible in other panels) - S3 Write Failures by Error Category (replaced by comprehensive breakdown) - Persister Retry Operations (superseded by specific retry queue panels) Technical Details: - Overall cache miss uses layer_cache_request_latency_ms metric (verified in layer_cache.rs:347-351) tracking complete misses across all layers - All backends explicitly report both hit and miss metrics (verified in layer_cache.rs:154-308) - Template union queries ensure all backends and services appear dynamically

1 parent d525e65 commit 92b78aaCopy full SHA for 92b78aa

1 file changed

+1016

-952

lines changed

dev/config/grafana/provisioning/dashboards
- layer-cache-metrics.json

1 file changed

+1016

-952

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 92b78aa

1 file changed

1 file changed

File tree

1 file changed

1 file changed

0 commit comments