feat: Add dynamic burn rate alerting for traffic-aware SLO thresholds #1602

yairst · 2025-10-18T14:05:12Z

Overview

This PR implements dynamic burn rate alerting that adapts alert thresholds based on actual traffic patterns, preventing false positives during low traffic and false negatives during high traffic periods.

Motivation

Traditional static burn rate multipliers (14x, 7x, 2x, 1x) don't account for traffic variations, leading to:

False positives during low traffic (few errors trigger alerts due to small sample sizes)
False negatives during high traffic (many errors go undetected)

Dynamic burn rates solve this by calculating thresholds that maintain consistent absolute error budget consumption regardless of traffic volume:

dynamic_threshold = (N_SLO / N_alert) × E_budget_percent × (1 - SLO_target)

Key Insight: This formula ensures alerts fire at the same absolute number of errors regardless of traffic. The threshold percentage adapts to traffic: lower during high traffic, higher during low traffic, but always requiring the same absolute error count.

This methodology is based on my blog post, "Error Budget Is All You Need - Part 2".

Implementation Summary

Backend Changes

Core Implementation (slo/rules.go):

Added buildDynamicAlertExpr() method implementing traffic-aware threshold calculation
Enhanced Burnrates() method to route between static and dynamic expressions
Integrated dynamic window logic with proper E_budget_percent mapping (1/48, 1/16, 1/14, 1/7)
Multi-window consistency: Both short and long windows use N_long for traffic scaling

CRD Changes (kubernetes/api/v1alpha1/servicelevelobjective_types.go):

Added BurnRateType field to SLO spec (values: "static", "dynamic")
Default: "static" (preserves existing behavior)
Backward compatible: Existing SLOs continue working unchanged

Indicator Type Support:

✅ Ratio: Uses increase() for traffic calculation
✅ Latency: Uses histogram _count metrics with le="" label selector
✅ LatencyNative: Uses histogram_count(sum(increase(...))) for native histograms
✅ BoolGauge: Uses count_over_time() for boolean gauge observations

API Changes

Protobuf (proto/objectives/v1alpha1/objectives.proto):

Added burn_rate_type field to Objective message
Values: "static" (default), "dynamic"
Full end-to-end transmission from CRD → Backend → API → UI

UI Changes

Core Components:

List Page (ui/src/List.tsx): Added "Burn Rate" column with sortable badges
Detail Page (ui/src/Detail.tsx): Added burn rate type badge with traffic context
Alerts Table (ui/src/AlertsTable.tsx): Added "Error Budget Consumption" column
Threshold Display (ui/src/components/BurnRateThresholdDisplay.tsx): Real-time dynamic threshold calculation
Burn Rate Graph (ui/src/components/BurnrateGraph.tsx): Dynamic threshold visualization

User Experience Enhancements:

Visual Indicators: Green "Dynamic" badges vs gray "Static" badges with appropriate icons
Enhanced Tooltips: Context-aware explanations showing traffic impact on alert sensitivity
Real-Time Calculations: Live threshold values instead of placeholder text
Traffic Context: Shows current traffic ratio and above/below average status
Error Handling: Graceful degradation for missing metrics with meaningful error messages

Backend Alert Rules:

Alert rules optimized to use recording rules for SLO window calculation
Reduces Prometheus evaluation load (rules evaluated every 30s)
Maintains accuracy while improving performance

Testing Evidence

Mathematical Validation

Core Concept: Dynamic burn rates maintain consistent absolute error budget consumption regardless of traffic volume.

Mathematical Proof:

Alert fires when: error_rate > (N_SLO / N_alert) × E_budget_percent × (1 - SLO_target)

Since error_rate = errors / N_alert, we can substitute:

errors / N_alert > (N_SLO / N_alert) × E_budget_percent × (1 - SLO_target)

Multiply both sides by N_alert:

errors > N_SLO × E_budget_percent × (1 - SLO_target)

Since N_SLO × (1 - SLO_target) = E_budget (absolute error budget for SLO period):

errors > E_budget_percent × E_budget

Result: The N_alert terms cancel out! Alerts fire at the same absolute error count regardless of traffic.

Example Validation:

Given:

SLO target: 99% (so 1 - SLO_target = 0.01)
E_budget_percent: 0.02 (2% of error budget per alert window)
N_SLO (30d): 1,000,000 requests
E_budget (absolute): 1,000,000 × 0.01 = 10,000 errors allowed in 30d

High Traffic Scenario:

N_alert (1h): 10,000 requests
Traffic Ratio: 1,000,000 / 10,000 = 100x
Dynamic Threshold: 100 × 0.02 × 0.01 = 0.02 (2%)
Absolute errors needed: 10,000 × 0.02 = 200 errors

Low Traffic Scenario:

N_alert (1h): 1,000 requests
Traffic Ratio: 1,000,000 / 1,000 = 1,000x
Dynamic Threshold: 1,000 × 0.02 × 0.01 = 0.2 (20%)
Absolute errors needed: 1,000 × 0.2 = 200 errors

Same absolute threshold (200 errors), vastly different error rate thresholds (2% vs 20%)!

Benefits:

Prevents false positives: During low traffic, 20 errors out of 1,000 (2%) won't alert because threshold is 20%
Maintains sensitivity: During high traffic, 200 errors out of 10,000 (2%) will alert because threshold is 2%
Consistent behavior: Always alerts when 200 errors occur (2% of the 10,000 error budget)

Validation Results:

✅ Window scaling correctly adapts to different SLO periods (28d → 30d)
✅ Recording rules use appropriate PromQL functions (rate, increase)
✅ Alert thresholds correctly implement dynamic formula
✅ E_budget_percent thresholds correctly map from static factors (14→1/48, 7→1/16, 2→1/14, 1→1/7)
✅ Multi-window logic uses consistent traffic scaling
✅ All indicator types (ratio, latency, latencyNative, boolGauge) validated

UI Regression Testing

Test Environment:

16 SLOs total (4 static, 12 dynamic)
Multiple indicator types tested
Both working and broken metrics scenarios
Minikube cluster with kube-prometheus stack

Regression Testing Results:

✅ Zero regressions found - All original Pyrra functionality preserved
✅ Static SLO behavior identical to baseline (except intentional enhancements)
✅ 6 intentional new features successfully integrated
✅ No visual glitches, layout issues, or console errors
✅ Mixed static/dynamic environment stable

Production Build Validation:

✅ All production build tests passed
✅ Critical missing metrics fixes working perfectly (no white page crash)
✅ All indicator types working correctly
✅ Graceful error handling for missing/broken metrics
✅ Performance acceptable (< 3 seconds page load)

Alert Firing Validation

Validated Results:

✅ Synthetic traffic generation working (20 req/sec with configurable error rate)
✅ Alert state transitions detected: inactive → pending → firing
✅ Both static and dynamic alerts fire correctly
✅ Dynamic alerts demonstrate improved sensitivity

Browser Compatibility Testing

Browsers Tested:

✅ Chrome (primary development browser)
✅ Firefox (full compatibility confirmed)
⚠️ Edge (not tested - assumed compatible as Chromium-based)

Graceful Degradation Testing:

✅ Network throttling: Proper loading states and retry logic
✅ API failures: Meaningful error messages
✅ Prometheus unavailability: Graceful fallback displays
✅ Missing metrics: No crashes, appropriate error states

Breaking Changes

None. This feature is completely opt-in and backward compatible:

Default behavior: burnRateType: static (existing behavior)
Existing SLOs: Continue working unchanged
Migration: Add burnRateType: dynamic to enable new behavior
No schema changes that break existing deployments

Migration Guide

Enabling Dynamic Burn Rates

For new SLOs, add burnRateType: dynamic to the alerting section:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: my-service-slo
spec:
  target: "99"
  window: 28d
  indicator:
    ratio:
      errors:
        metric: http_requests_total{code=~"5.."}
      total:
        metric: http_requests_total
  alerting:
    name: MyServiceErrorBudgetBurn
    burnRateType: dynamic  # Add this line
    burnrates: true

For existing SLOs, edit the SLO YAML and add burnRateType: dynamic:

kubectl edit slo my-service-slo -n monitoring

Validation

After enabling dynamic burn rates:

Check Prometheus Rules: Verify dynamic expressions generated
```
kubectl get prometheusrule -n monitoring
```
Check UI: Verify green "Dynamic" badge appears on SLO list page
Check Thresholds: Verify calculated threshold values display in alerts table
Monitor Alerts: Observe alert behavior with traffic variations

Rollback

To revert to static burn rates:

alerting:
  burnRateType: static  # Change back to static

Or remove the field entirely (defaults to static).

Examples

Four comprehensive examples are provided in the examples/ directory:

examples/dynamic-burn-rate-ratio.yaml - Ratio indicator (API success rate)
examples/dynamic-burn-rate-latency.yaml - Latency indicator (histogram-based)
examples/dynamic-burn-rate-latency-native.yaml - Native histogram latency
examples/dynamic-burn-rate-bool-gauge.yaml - Boolean gauge (availability)

Each example includes:

Clear comments explaining use cases
Proper metric selectors
Recommended configuration values
Traffic-aware alerting benefits

Design Decisions

1. Opt-In Feature (Not Default)

Decision: Dynamic burn rates require explicit burnRateType: dynamic configuration

Rationale:

Preserves existing behavior for current users
Allows gradual adoption and testing
Reduces risk of unexpected alert behavior changes
Users can evaluate feature before full deployment

2. Latency Indicator Label Selector

Decision: Always add le="" label selector when querying latency recording rules

Rationale:

Latency indicators create TWO recording rules: total (le="") and success (le="0.1")
Without le="", sum() aggregation includes BOTH rules (2x traffic)
Explicit le="" selector ensures only total traffic is counted
Critical for accurate dynamic threshold calculation

3. Error Handling Strategy

Decision: Graceful degradation with fallback displays instead of crashes

Rationale:

Production environments may have missing or misconfigured metrics
Users need visibility into SLO configuration even with data issues
Fallback to "Traffic-Aware" or "No data" better than white page crash
Console warnings help debugging without breaking UI

Documentation

User-Facing Documentation

Updated Files:

README.md - Added dynamic burn rate feature section
examples/README.md - Added dynamic SLO examples with explanations
examples/*.yaml - Four comprehensive example configurations

Development Documentation (Fork Only)

Comprehensive Internal Documentation (40+ documents in .dev-docs/):

Implementation summaries and session notes
Testing procedures and validation reports
Mathematical correctness validation
Performance benchmarks and optimization analysis
Browser compatibility matrices
Migration guides and troubleshooting
Development workflow and standards

Development Tools (Fork Only - cmd/ directory):

Query performance validation tools
Threshold calculation testing tools
Alert rule validation tools
Recording rule validation tools
Synthetic metric generation for testing
Performance monitoring tools

References

Methodology:

"Error Budget Is All You Need - Part 2" - Blog post explaining dynamic burn rate methodology

Before/After Examples

Example 1: Static vs Dynamic Threshold Comparison

Scenario: API service with 99% SLO target, 30d window

Static Burn Rate (Factor 14):

Threshold = 14 × (1 - 0.99) = 0.14 (14% error rate)

Same threshold regardless of traffic
14% error rate required to trigger alert
Does not adapt to traffic patterns

Dynamic Burn Rate (High Traffic):

N_SLO (30d): 1,000,000 requests
N_alert (1h): 10,000 requests
Traffic Ratio: 100x
E_budget_percent: 0.02 (2% of error budget)

Threshold = 100 × 0.02 × 0.01 = 0.02 (2% error rate)
Absolute errors needed = 10,000 × 0.02 = 200 errors

Lower threshold percentage (2%)
Same absolute errors needed (200 errors)

Dynamic Burn Rate (Low Traffic):

N_SLO (30d): 1,000,000 requests
N_alert (1h): 1,000 requests
Traffic Ratio: 1,000x
E_budget_percent: 0.02 (2% of error budget)

Threshold = 1,000 × 0.02 × 0.01 = 0.2 (20% error rate)
Absolute errors needed = 1,000 × 0.2 = 200 errors

Higher threshold percentage (20%)
Same absolute errors needed (200 errors)
Prevents false positives from small sample sizes

Key Insight: Both scenarios require the same absolute number of errors (200), but the error rate thresholds differ dramatically (2% vs 20%).

Example 2: UI Display Comparison

Static SLO - List Page:

┌─────────────────────────────────────────────────┐
│ Name: apiserver-requests-static                 │
│ Burn Rate: [Static] 🔒                          │
│ Availability: 99.95%                            │
│ Budget: 95.2%                                   │
└─────────────────────────────────────────────────┘

Dynamic SLO - List Page:

┌─────────────────────────────────────────────────┐
│ Name: apiserver-requests-dynamic                │
│ Burn Rate: [Dynamic] 👁                         │
│ Availability: 99.95%                            │
│ Budget: 95.2%                                   │
└─────────────────────────────────────────────────┘

Static SLO - Alerts Table:

┌──────────┬──────────┬────────────┬────────┬───────────┐
│ Severity │ Exhaust  │ Factor     │ Thresh │ Short     │
├──────────┼──────────┼────────────┼────────┼───────────┤
│ critical │ 2d       │ 14         │ 0.700  │ 0.123     │
│ critical │ 6d       │ 7          │ 0.350  │ 0.456     │
│ warning  │ 12d      │ 2          │ 0.100  │ 0.789     │
│ warning  │ 30d      │ 1          │ 0.050  │ 0.234     │
└──────────┴──────────┴────────────┴────────┴───────────┘

Dynamic SLO - Alerts Table:

┌──────────┬──────────┬────────────┬────────┬───────────┐
│ Severity │ Exhaust  │ Error Bdgt │ Thresh │ Short     │
├──────────┼──────────┼────────────┼────────┼───────────┤
│ critical │ 2d       │ 2.08%      │ 0.0046 │ 0.123     │
│ critical │ 6d       │ 6.25%      │ 0.0137 │ 0.456     │
│ warning  │ 12d      │ 7.14%      │ 0.0156 │ 0.789     │
│ warning  │ 30d      │ 14.29%     │ 0.0313 │ 0.234     │
└──────────┴──────────┴────────────┴────────┴───────────┘

Tooltip Comparison:

Static SLO Tooltip:

Static Burn Rate
Uses fixed multipliers (14x, 7x, 2x, 1x) for alert thresholds.
Threshold = 14 × (1 - 0.99) = 0.700

Dynamic SLO Tooltip:

Dynamic Burn Rate
Adapts thresholds based on traffic patterns.

Error Budget: 2.08% (burns 2.08% of budget per alert window)
Traffic Ratio: 100x (high traffic)
Dynamic Threshold: 0.02 (2%) - Adapts to traffic
Static Threshold: 0.14 (14%) - Same threshold regardless of traffic

Formula: (N_SLO / N_alert) × E_budget_percent × (1 - SLO_target)

Example 3: Alert Rule Comparison

Static Alert Rule (PrometheusRule):

- alert: ApiserverRequestsStaticErrorBudgetBurn
  expr: |
    (
      apiserver_request:burnrate5m{slo="apiserver-requests-static"} > (14 * (1 - 0.99))
      and
      apiserver_request:burnrate1h4m{slo="apiserver-requests-static"} > (14 * (1 - 0.99))
    )
  labels:
    severity: critical
    long: 1h4m
    short: 5m

Dynamic Alert Rule (PrometheusRule):

- alert: ApiserverRequestsDynamicErrorBudgetBurn
  expr: |
    (
      apiserver_request:burnrate5m{slo="apiserver-requests-dynamic"} > 
      scalar((sum(apiserver_request:increase30d{slo="apiserver-requests-dynamic"}) / 
              sum(increase(apiserver_request_total{verb="GET"}[1h4m]))) * 0.020833 * (1 - 0.99))
      and
      apiserver_request:burnrate1h4m{slo="apiserver-requests-dynamic"} > 
      scalar((sum(apiserver_request:increase30d{slo="apiserver-requests-dynamic"}) / 
              sum(increase(apiserver_request_total{verb="GET"}[1h4m]))) * 0.020833 * (1 - 0.99))
    )
  labels:
    severity: critical
    long: 1h4m
    short: 5m

Key Differences:

Static uses fixed multiplier (14)
Dynamic calculates traffic ratio (N_SLO / N_alert)
Dynamic uses E_budget_percent (0.020833 = 1/48)
Dynamic uses recording rules for SLO window (optimized)
Both use same burn rate recording rules for error rate

- Add DynamicBurnRate type and configuration - Implement GetRemainingErrorBudget calculation - Clean up redundant code in rules.go - Improve Windows() function readability - Add dynamic factor scaling based on remaining error budget

- Add space before inline comments - Remove trailing whitespace - Normalize newlines in functions - Improve code readability

Add dynamic burn rate calculation that uses error budget percentages: - 1/48 (2.08%) per hour (50% per day) - 1/16 (6.25%) per 6h (100% per 4 days) - 1/14 (7.14%) per day - 1/7 (14.28%) per 4 days Implements dynamic burn rate calculation formula: (increase[slo_window] / increase[alert_window]) * error_budget_percent The implementation preserves existing window periods while adding proper error budget burn percentages for more accurate alerting.

- Add core dynamic burn rate logic for Ratio indicators - Implement buildAlertExpr() and buildDynamicAlertExpr() methods - Add dynamic threshold calculation: (N_SLO/N_alert) × E_budget_percent_threshold × (1-SLO_target) - Support traffic-aware alerting with proper PromQL generation - Maintain backward compatibility with static burn rate as default - Add comprehensive unit tests for both static and dynamic modes - Update test expectations to reflect 'static' as default BurnRateType Dynamic burn rate adapts alert thresholds to traffic volume: - Higher traffic periods get proportionally higher thresholds - Lower traffic periods get lower thresholds - Uses increase() functions for event counting over SLO and alert windows - Properly handles metric selectors and label matchers for error/total metrics All tests passing. Ready for extension to other indicator types.

… summary - Create comprehensive sli-indicator-types.md explaining all four indicator types - Document purpose, use cases, and burn rate calculations for each type - Explain why different indicator types exist and how they map to different metric formats - Update FEATURE_IMPLEMENTATION_SUMMARY.md with latest implementation status - Add clarifications about E_budget_percent_threshold being constants - Document current capabilities and remaining work priorities - Explain implementation strategy for indicator type support

- Fix 'Burn Rate Calculation' → 'Error Rate Calculation' in SLI indicator types - Update success criteria to reflect completed work: - API support: ✅ Complete - Dynamic alert thresholds: ✅ Complete (for Ratio indicators) - Traffic adaptation: ✅ Complete (for Ratio indicators) - Backward compatibility: ✅ Complete - Documentation: ✅ Complete - Performance validation: ✅ Complete (in tests) - Update status to reflect Priority 1 completion

� Core Implementation: - Extended buildDynamicAlertExpr() to support Latency indicators - Updated Burnrates() method for Latency case to use dynamic windows - Added helper methods buildTotalSelector() and buildLatencyTotalSelector() ⚡ Performance Optimization: - Both Ratio and Latency indicators now use recording rules for efficiency - Alert expressions use pre-computed burn rates + dynamic threshold calculation - Significantly reduces Prometheus evaluation load vs inline calculations � Comprehensive Testing: - Added TestObjective_DynamicBurnRate_Latency() test - Extended TestObjective_buildAlertExpr() with Latency test cases - Updated test expectations for optimized recording rule usage - All tests pass with new implementation � Current Support Status: - ✅ Ratio Indicators: Full dynamic burn rate support - ✅ Latency Indicators: Full dynamic burn rate support (NEW) - ⏳ LatencyNative & BoolGauge: Fall back to static (TODO) � Examples & Documentation: - Added examples/latency-dynamic-burnrate.yaml with practical configs - Updated feature implementation summary and SLI documentation - Documented performance improvements and implementation approach The implementation is production-ready and maintains full backward compatibility.

- Add CORE_CONCEPTS_AND_TERMINOLOGY.md with authoritative definitions for: * Error Rate, Error Budget, Burn Rate concepts * Static vs Dynamic burn rate threshold differences * Traffic scaling factor (N_SLO / N_alert) explanation * False positive/negative prevention mechanisms * Mathematical relationships and PromQL patterns - Update FEATURE_IMPLEMENTATION_SUMMARY.md to reference core concepts doc - Streamline implementation summary to focus on status and progress - Establish single source of truth for conceptual understanding These docs capture the corrected understanding of dynamic burn rate concepts for future code review sessions and development work.

Core Implementation Fixes: - Fix multi-window logic to use N_long for both windows (consistent traffic scaling) - Remove unused dynamicBurnRateExpr() function (code cleanup) - Fix DynamicWindows() to use scaled periods from Windows(sloWindow) - Map E_budget_percent_thresholds by static factor hierarchy (14→1/48, etc.) Key Behavioral Corrections: - Both short and long windows now use N_long denominator for traffic scaling - Window periods properly scale with any SLO duration via Windows() function - E_budget_percent_thresholds remain constant across SLO period choices - Window.Factor correctly serves as E_budget_percent_threshold in dynamic mode Documentation Updates: - Add multi-window logic explanation to CORE_CONCEPTS_AND_TERMINOLOGY.md - Add Window.Factor dual purpose design documentation - Add window period scaling details and architectural insights - Update FEATURE_IMPLEMENTATION_SUMMARY.md with recent fixes - Correct formula from (N_SLO / N_alert) to (N_SLO / N_long) All tests pass including TestObjective_DynamicBurnRate and TestObjective_DynamicBurnRate_Latency. Mathematical implementation now correctly matches the expected dynamic burn rate formula.

✅ Code Review Completed - Production Ready Status - Updated FEATURE_IMPLEMENTATION_SUMMARY.md with code review completion - Confirmed production readiness for Ratio & Latency indicators - Added edge case handling validation results - Updated PromQL examples to show recording rule implementation - Documented comprehensive test coverage completion ✅ Session Continuation Updates - Updated SESSION_CONTINUATION_PROMPT.md with correct status - Removed non-existent compilation error references - Added production readiness confirmation - Updated next priority tasks for remaining indicator types Status: Dynamic burn rate implementation for Ratio & Latency indicators is production-ready and fully validated through comprehensive code review.

…ypes - Extend dynamic burn rate support to LatencyNative and BoolGauge indicators - Add buildLatencyNativeTotalSelector() and buildBoolGaugeSelector() helper methods - Implement traffic-aware expressions for native histograms and boolean gauges - Add dynamic window logic to LatencyNative and BoolGauge cases in Burnrates() - Replace hardcoded alert expressions with unified buildAlertExpr() method - Add comprehensive test coverage for all indicator types - All backend dynamic burn rate logic now complete and production-ready Backend implementation status: ✅ Ratio indicators - Dynamic burn rate complete ✅ Latency indicators - Dynamic burn rate complete ✅ LatencyNative indicators - Dynamic burn rate complete ✅ BoolGauge indicators - Dynamic burn rate complete Next: UI integration and Grafana dashboard updates (future sessions)

- Create new prompts/ folder for session organization - Move existing session prompts from .dev-docs/ to prompts/ - Add NEXT_SESSION_PROMPT.md focused on React UI integration - Add prompts/README.md documenting session strategy Next session focus: - React UI integration for BurnRateType selection - Update SLO forms to support dynamic burn rate configuration - Backend implementation complete and ready for frontend work

- NEXT_SESSION_PROMPT.md → UI_INTEGRATION_SESSION_PROMPT.md - SESSION_CONTINUATION_PROMPT.md → BACKEND_COMPLETION_SESSION_PROMPT.md - Update README.md with new prompt names and usage guide Better organization: - Clear indication of session purpose and focus area - Active vs completed session status - Easy identification of which prompt to use next

- Add burn rate type display system with color-coded badges - Implement burn rate column in SLO list with sorting and visibility controls - Add burn rate information section to SLO detail pages - Create TypeScript infrastructure with BurnRateType enum and utilities - Add dynamic/static icons for visual distinction - Implement responsive design with tooltips and accessibility - Create demo SLO configurations for testing - Add comprehensive UI documentation - Update feature implementation status - Prepare next session prompt for API integration The UI foundation is now complete with mock detection logic. Next phase: API integration to eliminate mock data and connect to actual backend burn rate type field.

✅ All 5 core tasks completed: 1. Added Alerting message with burnRateType field to protobuf schema 2. Updated Go conversion functions (ToInternal/FromInternal) in objectives.go 3. Regenerated TypeScript protobuf definitions and implementations 4. Replaced mock detection logic with real API field access in burnrate.tsx 5. Validated end-to-end API integration with comprehensive testing � Technical Implementation: - Protobuf: Added Alerting message with string burn_rate_type field - Go: Complete bidirectional conversion between internal structs and protobuf - TypeScript: Manual updates for Windows compatibility with proper interfaces - Frontend: Real API field access (objective.alerting?.burnRateType) - Testing: Round-trip validation for both 'dynamic' and 'static' types � Status: API Integration Complete - Production Ready � Next: Priority 2 Alert Display Updates

…mic burn rates - Verified all 5 generic rules work identically for static and dynamic SLOs - Confirmed Grafana dashboards display both SLO types correctly without modifications - Validated error budget calculations use same formula for both types - Tested list and detail dashboards with mixed static/dynamic SLOs - Documented pre-existing Rate graph query bug (unrelated to feature) - Created comprehensive validation session document - Result: NO CHANGES NEEDED - dashboards work perfectly with dynamic SLOs

- Analyzed BurnRateThresholdDisplay implementation (uses raw metrics) - Validated recording rules provide 40x speedup for ratio indicators - Created validation tools for performance testing - Documented optimization strategy and performance benchmarks - Created sub-tasks 7.10.1-7.10.4 for implementation phase Analysis documents: - TASK_7.10_UI_QUERY_OPTIMIZATION_ANALYSIS.md - Full analysis - TASK_7.10_VALIDATION_RESULTS.md - Performance benchmarks - TASK_7.10_COMPLETION_SUMMARY.md - Phase 1 summary Validation tools: - cmd/validate-ui-query-optimization - Performance comparison - cmd/test-burnrate-threshold-queries - Query validation Key findings: - Ratio indicators: 694ms -> 17ms (40x speedup potential) - Latency indicators: 43ms -> 26ms (1.7x speedup potential) - Recording rules exist but UI doesn't use them yet - Implementation will happen in sub-tasks 7.10.1-7.10.4

- Fixed test queries to use only SLO window recording rules (not alert windows) - Added statistical rigor: 10 iterations per query with min/max/avg analysis - Added BoolGauge indicator testing (all three types now covered) - Executed tests and documented real performance measurements: * Ratio: 7.17x speedup (48.75ms -> 6.80ms) * Latency: 2.20x speedup (6.34ms -> 2.89ms) * BoolGauge: No benefit (already fast at 3ms) - Clarified terminology: SLO window vs alert windows - Key finding: Only SLO window has increase/count recording rules - Updated task 7.10.2 with validation findings and implementation guide - Created comprehensive implementation guide for next task

- Add hybrid query approach: recording rules for SLO window + inline for alert windows - Implement getBaseMetricName() to strip metric suffixes for recording rule naming - Implement getTrafficRatioQueryOptimized() for optimized query generation - Optimize ratio indicators (7.17x query speedup) and latency indicators (2.20x speedup) - Skip boolGauge optimization (already fast at 3ms) - Fix performance monitoring bug (was showing 677s due to never-reset timer) - Primary benefit: Prometheus load reduction, not UI speed (network overhead dominates) - Maintains backward compatibility with fallback to raw metrics Task 7.10.2 complete

- Add references to TASK_7.10_VALIDATION_RESULTS.md, TASK_7.10_IMPLEMENTATION.md, and TASK_7.10.1_TEST_IMPROVEMENTS.md - Include key findings from 7.10.2: network overhead dominates, main benefit is Prometheus load reduction - Update 7.10.4 to reflect validation already completed in 7.10.2 - Clarify that optimization provides minimal UI benefit but significant infrastructure benefit

- Created comprehensive decision document analyzing backend optimization - Documented current implementation vs optimized pattern - Calculated performance benefits: 7x for ratio, 2x for latency indicators - Production impact: ~1.77M seconds/year saved for ratio indicators at scale - Decision: IMPLEMENT optimization (primary benefit: Prometheus load reduction) - Added Task 7.10.5 to implementation plan for backend optimization - Updated feature implementation summary with Task 7.10.3 completion Key findings: - Alert rules evaluated every 30s (different profile than UI on-demand queries) - Main benefit is infrastructure load reduction, not alert evaluation speed - Hybrid approach: recording rule for SLO window + inline for alert windows - Consistent with UI implementation (Task 7.10.2) - Priority: MEDIUM-HIGH, implement after Task 7.10.4 References: - .dev-docs/TASK_7.10.3_BACKEND_OPTIMIZATION_DECISION.md - .dev-docs/FEATURE_IMPLEMENTATION_SUMMARY.md - .kiro/specs/dynamic-burn-rate-completion/tasks.md

- Added getBaseMetricName() helper function to strip metric suffixes - Updated buildDynamicAlertExpr() for ratio indicators to use hybrid approach (recording rules for SLO window) - Updated buildDynamicAlertExpr() for latency indicators to use hybrid approach - Skipped boolGauge optimization (already fast, no benefit) - Fixed UI regression: BurnRateThresholdDisplay now uses actual SLO window instead of hardcoded 30d - This fixes 'no data available' issue for synthetic SLOs with 1d window - Backend optimization provides 7x speedup for ratio, 2x for latency indicators - Primary benefit: Prometheus CPU/memory load reduction at scale

…ll 7.10 sub-tasks

- Fixed critical latency threshold bug (2x traffic counting) - Added le="" label selector in UI and backend for latency indicators - Prevents summing both total and success recording rules - Fixed BurnrateGraph to show dynamic thresholds over time - Changed from instant to range queries for traffic calculation - Fixed React console warnings - Toggle: Added readOnly attribute - Detail.tsx: Fixed duplicate keys - AlertsTable: Added Fragment keys - DurationGraph: Added null checks - Removed debug logging from production code - Validated all indicator types (Ratio, Latency, BoolGauge, LatencyNative) - Updated documentation and steering standards

- Created SLO generator tool with window variation (7d, 28d, 30d) - Created performance monitoring tool for metrics collection - Created automated test script for health checks and validation - Generated 50 test SLOs ready for scale testing - Consolidated documentation into TASK_7.11_TESTING_INFRASTRUCTURE.md - Created TASK_7.12_MANUAL_TESTING_GUIDE.md for interactive testing - Updated tasks.md with proper task structure and references - Cleaned up redundant documentation files

- Executed baseline performance test with 16 current SLOs - Applied and tested 50 additional SLOs (medium scale: 66 total) - Applied and tested 100 additional SLOs (large scale: 116 total) - Collected comprehensive performance metrics (API response time, memory usage, Prometheus query performance) - Created PRODUCTION_PERFORMANCE_BENCHMARKS.md with detailed analysis - Key findings: Sub-linear API scaling, near-constant memory usage, stable Prometheus performance - Production readiness assessment: READY - Updated gitignore to exclude temporary test binaries and JSON metrics files

…ul degradation - Tested Chrome and Firefox (both PASS - identical behavior) - Tested graceful degradation: network throttling, API failures, Prometheus unavailability (all PASS) - Tested migration: static to dynamic, rollback, backward compatibility (all PASS) - Created browser compatibility matrix with test results and recommendations - Created comprehensive migration guide (validated during testing) - Discovered and documented 3 issues (1 HIGH severity, 2 LOW severity) - Created Task 7.12.1 for critical bug fix (white page crash for missing metrics) Deliverables: - .dev-docs/BROWSER_COMPATIBILITY_MATRIX.md - Complete test results - .dev-docs/MIGRATION_GUIDE.md - Migration instructions and best practices - .dev-docs/TASK_7.12_TESTING_COMPLETION_SUMMARY.md - Testing summary Production readiness: Ready for environments with reliable metrics. Fix Task 7.12.1 before deploying to environments with potentially missing metrics.

- Fix BurnrateGraph white page crash for dynamic SLOs with missing metrics - Add comprehensive null/undefined checks before Array.from() calls - Wrap dynamic threshold calculation in try-catch for graceful error handling - Fallback to static threshold when traffic data is missing/broken - Add console warnings for debugging - Fix Detail page showing 100% instead of 'No data' for missing metrics - Change default values from errors=0, total=1 to undefined - Tiles now correctly display 'No data' (consistent with main page) - Update documentation with fix details and testing coverage

- Performed systematic regression testing against upstream-comparison branch - Validated production build with all recent fixes (Task 7.12.1) - Zero regressions found - all original Pyrra functionality preserved - All 4 production build tests passed successfully - Tested 16 SLOs (4 static, 12 dynamic) in mixed environment Regression Testing Results: - Static SLO behavior identical to baseline (except intentional enhancements) - 6 intentional new features successfully integrated - No visual glitches, layout issues, or console errors - Auto-reload confirmed as original Pyrra behavior (not a regression) Production Build Validation: - Critical Task 7.12.1 fixes working perfectly (no white page crash) - All indicator types working correctly (ratio, latency, latencyNative, boolGauge) - Graceful error handling for missing/broken metrics - Performance acceptable (< 3 seconds page load) - 1 minor cosmetic issue found (false console warning - not blocking) Key Findings: - Backend service required for proper burn rate type detection - Mixed static/dynamic environment stable and working correctly - Feature is production ready for upstream contribution Documentation: - Created comprehensive test results document - Created step-by-step testing procedure guide - Created quick reference checklist - Updated feature implementation summary Status: PRODUCTION READY - Zero blockers, ready for upstream contribution

- Restructured Task 8 to focus on upstream integration (fetch/merge, file organization, production docs, PR description) - Streamlined Task 9 to reference existing validation work (Tasks 1-7 already complete) - Added UPSTREAM_CONTRIBUTION_PLAN.md with file organization strategy and PR preparation guide - Emphasized keeping production documentation updates concise and proportional - Removed duplicate testing/documentation tasks already completed in Tasks 1-7

- Added Task 8.0 as mandatory pre-merge cleanup step (must do before Task 8.1) - Created comprehensive cleanup checklist in TASK_8.0_PRE_MERGE_CLEANUP_CHECKLIST.md - Addresses manual code review findings: - Revert unintended changes (CONTRIBUTING.md, deployment manifests, index.html, etc.) - Move examples from .dev/ to examples/ - Backend code cleanup (remove duplicates in slo/rules.go, unused code in slo/slo.go) - CRD cleanup (remove redundant variables) - Test file review and decisions - UI code review (Toggle.tsx, old docs) - Investigate filesystem.go changes and determine testing needs - Investigate proto changes - Updated Task 9.3 to reference filesystem mode testing decision from Task 8.0 - Updated UPSTREAM_CONTRIBUTION_PLAN.md timeline to include Task 8.0

- Reverted unintended changes (pyrra-kubernetesDeployment.yaml, ui/public/index.html, filesystem.go) - Removed unused code (~47 lines from slo/slo.go and CRD types) - Updated comment format in slo/rules.go - Moved ui/DYNAMIC_BURN_RATE_UI.md to .dev-docs/HISTORICAL_UI_DESIGN.md - Updated CONTRIBUTING.md with ui/README.md reference - Clarified architecture: test metric only needed in API server (main.go) - Created comprehensive cleanup documentation All tests passing, code compiles successfully.

- Add 4 dynamic burn rate examples (ratio, latency, latencyNative, boolGauge) - Create concise examples/README.md (~70 lines, comparable to upstream) - Use real metrics from actual services (apiserver, prometheus, pyrra) - Minimal comments consistent with existing examples - All examples verified showing actual data in Pyrra UI - Delete redundant latency-dynamic-burnrate.yaml and simple-demo.yaml - Add Task 8.5 for regex label selector investigation - Update documentation to reflect 4 examples Task 8.2 complete

- Task 8.4.1: Comprehensive upstream comparison testing - Tested regex selectors on upstream-comparison branch - Created test SLO configurations for validation - Confirmed no regressions from feature branch - Task 8.4.2: Root cause analysis - Identified grouping creates multiple SLOs (upstream behavior) - Identified NaN display issue (upstream cosmetic bug) - Documented technical architecture and design - Task 8.4.3: Solution implementation - Chose documentation approach (no code changes needed) - Updated KNOWN_LIMITATIONS.md with user guidance - Provided best practices and workarounds Key findings: - Regex selectors work correctly in both upstream and feature branch - Multiple SLO behavior with grouping is existing upstream design - NaN issue affects all SLOs universally (not regex-specific) - No regressions introduced by dynamic burn rate feature - Feature ready for upstream contribution Documentation created: - .dev-docs/UPSTREAM_COMPARISON_REGEX_SELECTORS.md (complete test results) - .dev-docs/KNOWN_LIMITATIONS.md (user-facing guidance) - .dev-docs/TASK_8.4.{1,2,3}_*.md (sub-task documentation) - .dev-docs/TASK_8.4_COMPLETE_SUMMARY.md (overall summary)

- Fixed duplicate task 8.3 (renamed second one to 8.5) - Marked task 8.4 as complete [x] - Marked task 8.4.3 as complete [x] (was [-]) - Renumbered 'Create pull request' task from 8.5 to 8.6 Task order now: - 8.3: Organize files for PR vs fork separation - 8.4: Investigate regex label selector (COMPLETE) - 8.5: Update production documentation - 8.6: Create pull request description

- Add concise dynamic burn rate section to README.md with Dev.to article reference - Enhance examples/README.md with usage guidance and migration notes - Add inline comments to all 4 dynamic burn rate example files - Correct mathematical explanations (high traffic = lower threshold) - Document task completion and mathematical correction - Follow 'concise and proportional' principle - dynamic burn rate is ONE feature Files updated: - README.md: Added Dynamic Burn Rate Alerting section - examples/README.md: Enhanced dynamic burn rate examples section - examples/dynamic-burn-rate-*.yaml: Added header and inline comments (4 files) - .dev-docs/: Added task documentation and math correction notes - .dev-docs/FEATURE_IMPLEMENTATION_SUMMARY.md: Updated with task 8.5 completion

- Created comprehensive file categorization document - Documented 10 categories: PR files vs fork files - Defined preservation strategy for dev artifacts - Provided 7-step action plan for file organization - Added verification checklist with 14 items - Updated feature implementation summary

- Remove all development-only files (.dev-docs, .kiro, cmd/, scripts/, prompts/, testing/) - Remove custom Docker files (Dockerfile.custom, Dockerfile.dev) - Update test expectations for le='' label on latency recording rules - Update test expectations for errors recording rules on ratio indicators - Add default Alerting field values to test objectives - Changes reflect query optimization work (task 7.10) and file organization (task 8.3) All tests passing, builds successful (backend + UI)

- Remove debug console.log statements from List.tsx and AlertsTable.tsx - Improve error logging (console.log -> console.error with context) - Part of code quality and standards review for upstream contribution Cherry-picked UI changes from dev-tools-and-docs branch (d13b8fc)

metalmatze · 2025-10-21T08:38:46Z

Wow!
This is incredible!

First thing, I'll have to read your blog post and fully understand how things are working.
Then, if we decide on adding the feature to Pyrra, I would like to know if we should go and split the features into various PRs. Currently, it's going to be quite hard to add with a gigantic PR, and probably won't be a good experience reviewing all files within one PR.

yairst added 30 commits August 22, 2025 19:10

update gitignore

afc7b14

Add dev docs

0b0e2dc

update dev docs

7023ba7

chore: apply gofumpt formatting & update dev docs

b525d39

update dev docs

5632ef6

update dev docs

6a8d16e

update dev docs

38baceb

update dev docs

63e4402

update dev docs

8e4cf53

Add SETUP_SUMMARY.md

36db7d2

Add burn rate analysis

b31edb2

update burn rate analysis

79b13d7

update burn rate analysis with ui and dashboard changes

2c13f7f

refactor: implement dynamic burn rate alerting

5c74455

- Add DynamicBurnRate type and configuration - Implement GetRemainingErrorBudget calculation - Clean up redundant code in rules.go - Improve Windows() function readability - Add dynamic factor scaling based on remaining error budget

style: cleanup whitespace and formatting

17a47fa

- Add space before inline comments - Remove trailing whitespace - Normalize newlines in functions - Improve code readability

add implementation notes to dev docs

db9978b

Add feature implementation summary by claude 4

a82b7f0

yairst added 29 commits October 8, 2025 20:49

Mark task 7.8 as completed

b0db4a5

Update task 7.10.4 with comprehensive documentation references from a…

4b1eae2

…ll 7.10 sub-tasks

Add Task 7.13 completion summary document

a7cec94

Merge upstream/main into add-dynamic-burn-rate

810a6e4

Complete Task 8.1: Fetch and merge from upstream repository

854f616

Mark task 8.1 as complete

577b9f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add dynamic burn rate alerting for traffic-aware SLO thresholds #1602

feat: Add dynamic burn rate alerting for traffic-aware SLO thresholds #1602

Uh oh!

yairst commented Oct 18, 2025

Uh oh!

metalmatze commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add dynamic burn rate alerting for traffic-aware SLO thresholds #1602

Are you sure you want to change the base?

feat: Add dynamic burn rate alerting for traffic-aware SLO thresholds #1602

Uh oh!

Conversation

yairst commented Oct 18, 2025

Overview

Motivation

Implementation Summary

Backend Changes

API Changes

UI Changes

Testing Evidence

Mathematical Validation

UI Regression Testing

Alert Firing Validation

Browser Compatibility Testing

Breaking Changes

Migration Guide

Enabling Dynamic Burn Rates

Validation

Rollback

Examples

Design Decisions

1. Opt-In Feature (Not Default)

2. Latency Indicator Label Selector

3. Error Handling Strategy

Documentation

User-Facing Documentation

Development Documentation (Fork Only)

References

Before/After Examples

Example 1: Static vs Dynamic Threshold Comparison

Example 2: UI Display Comparison

Example 3: Alert Rule Comparison

Uh oh!

metalmatze commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants