Skip to content

Conversation

@yairst
Copy link
Contributor

@yairst yairst commented Oct 18, 2025

Overview

This PR implements dynamic burn rate alerting that adapts alert thresholds based on actual traffic patterns, preventing false positives during low traffic and false negatives during high traffic periods.

Motivation

Traditional static burn rate multipliers (14x, 7x, 2x, 1x) don't account for traffic variations, leading to:

  • False positives during low traffic (few errors trigger alerts due to small sample sizes)
  • False negatives during high traffic (many errors go undetected)

Dynamic burn rates solve this by calculating thresholds that maintain consistent absolute error budget consumption regardless of traffic volume:

dynamic_threshold = (N_SLO / N_alert) × E_budget_percent × (1 - SLO_target)

Key Insight: This formula ensures alerts fire at the same absolute number of errors regardless of traffic. The threshold percentage adapts to traffic: lower during high traffic, higher during low traffic, but always requiring the same absolute error count.

This methodology is based on my blog post, "Error Budget Is All You Need - Part 2".

Implementation Summary

Backend Changes

Core Implementation (slo/rules.go):

  • Added buildDynamicAlertExpr() method implementing traffic-aware threshold calculation
  • Enhanced Burnrates() method to route between static and dynamic expressions
  • Integrated dynamic window logic with proper E_budget_percent mapping (1/48, 1/16, 1/14, 1/7)
  • Multi-window consistency: Both short and long windows use N_long for traffic scaling

CRD Changes (kubernetes/api/v1alpha1/servicelevelobjective_types.go):

  • Added BurnRateType field to SLO spec (values: "static", "dynamic")
  • Default: "static" (preserves existing behavior)
  • Backward compatible: Existing SLOs continue working unchanged

Indicator Type Support:

  • Ratio: Uses increase() for traffic calculation
  • Latency: Uses histogram _count metrics with le="" label selector
  • LatencyNative: Uses histogram_count(sum(increase(...))) for native histograms
  • BoolGauge: Uses count_over_time() for boolean gauge observations

API Changes

Protobuf (proto/objectives/v1alpha1/objectives.proto):

  • Added burn_rate_type field to Objective message
  • Values: "static" (default), "dynamic"
  • Full end-to-end transmission from CRD → Backend → API → UI

UI Changes

Core Components:

  • List Page (ui/src/List.tsx): Added "Burn Rate" column with sortable badges
  • Detail Page (ui/src/Detail.tsx): Added burn rate type badge with traffic context
  • Alerts Table (ui/src/AlertsTable.tsx): Added "Error Budget Consumption" column
  • Threshold Display (ui/src/components/BurnRateThresholdDisplay.tsx): Real-time dynamic threshold calculation
  • Burn Rate Graph (ui/src/components/BurnrateGraph.tsx): Dynamic threshold visualization

User Experience Enhancements:

  • Visual Indicators: Green "Dynamic" badges vs gray "Static" badges with appropriate icons
  • Enhanced Tooltips: Context-aware explanations showing traffic impact on alert sensitivity
  • Real-Time Calculations: Live threshold values instead of placeholder text
  • Traffic Context: Shows current traffic ratio and above/below average status
  • Error Handling: Graceful degradation for missing metrics with meaningful error messages

Backend Alert Rules:

  • Alert rules optimized to use recording rules for SLO window calculation
  • Reduces Prometheus evaluation load (rules evaluated every 30s)
  • Maintains accuracy while improving performance

Testing Evidence

Mathematical Validation

Core Concept: Dynamic burn rates maintain consistent absolute error budget consumption regardless of traffic volume.

Mathematical Proof:

Alert fires when: error_rate > (N_SLO / N_alert) × E_budget_percent × (1 - SLO_target)

Since error_rate = errors / N_alert, we can substitute:

errors / N_alert > (N_SLO / N_alert) × E_budget_percent × (1 - SLO_target)

Multiply both sides by N_alert:

errors > N_SLO × E_budget_percent × (1 - SLO_target)

Since N_SLO × (1 - SLO_target) = E_budget (absolute error budget for SLO period):

errors > E_budget_percent × E_budget

Result: The N_alert terms cancel out! Alerts fire at the same absolute error count regardless of traffic.

Example Validation:

Given:

  • SLO target: 99% (so 1 - SLO_target = 0.01)
  • E_budget_percent: 0.02 (2% of error budget per alert window)
  • N_SLO (30d): 1,000,000 requests
  • E_budget (absolute): 1,000,000 × 0.01 = 10,000 errors allowed in 30d

High Traffic Scenario:

  • N_alert (1h): 10,000 requests
  • Traffic Ratio: 1,000,000 / 10,000 = 100x
  • Dynamic Threshold: 100 × 0.02 × 0.01 = 0.02 (2%)
  • Absolute errors needed: 10,000 × 0.02 = 200 errors

Low Traffic Scenario:

  • N_alert (1h): 1,000 requests
  • Traffic Ratio: 1,000,000 / 1,000 = 1,000x
  • Dynamic Threshold: 1,000 × 0.02 × 0.01 = 0.2 (20%)
  • Absolute errors needed: 1,000 × 0.2 = 200 errors

Same absolute threshold (200 errors), vastly different error rate thresholds (2% vs 20%)!

Benefits:

  • Prevents false positives: During low traffic, 20 errors out of 1,000 (2%) won't alert because threshold is 20%
  • Maintains sensitivity: During high traffic, 200 errors out of 10,000 (2%) will alert because threshold is 2%
  • Consistent behavior: Always alerts when 200 errors occur (2% of the 10,000 error budget)

Validation Results:

  • ✅ Window scaling correctly adapts to different SLO periods (28d → 30d)
  • ✅ Recording rules use appropriate PromQL functions (rate, increase)
  • ✅ Alert thresholds correctly implement dynamic formula
  • ✅ E_budget_percent thresholds correctly map from static factors (14→1/48, 7→1/16, 2→1/14, 1→1/7)
  • ✅ Multi-window logic uses consistent traffic scaling
  • ✅ All indicator types (ratio, latency, latencyNative, boolGauge) validated

UI Regression Testing

Test Environment:

  • 16 SLOs total (4 static, 12 dynamic)
  • Multiple indicator types tested
  • Both working and broken metrics scenarios
  • Minikube cluster with kube-prometheus stack

Regression Testing Results:

  • Zero regressions found - All original Pyrra functionality preserved
  • ✅ Static SLO behavior identical to baseline (except intentional enhancements)
  • ✅ 6 intentional new features successfully integrated
  • ✅ No visual glitches, layout issues, or console errors
  • ✅ Mixed static/dynamic environment stable

Production Build Validation:

  • ✅ All production build tests passed
  • ✅ Critical missing metrics fixes working perfectly (no white page crash)
  • ✅ All indicator types working correctly
  • ✅ Graceful error handling for missing/broken metrics
  • ✅ Performance acceptable (< 3 seconds page load)

Alert Firing Validation

Validated Results:

  • ✅ Synthetic traffic generation working (20 req/sec with configurable error rate)
  • ✅ Alert state transitions detected: inactive → pending → firing
  • ✅ Both static and dynamic alerts fire correctly
  • ✅ Dynamic alerts demonstrate improved sensitivity

Browser Compatibility Testing

Browsers Tested:

  • ✅ Chrome (primary development browser)
  • ✅ Firefox (full compatibility confirmed)
  • ⚠️ Edge (not tested - assumed compatible as Chromium-based)

Graceful Degradation Testing:

  • ✅ Network throttling: Proper loading states and retry logic
  • ✅ API failures: Meaningful error messages
  • ✅ Prometheus unavailability: Graceful fallback displays
  • ✅ Missing metrics: No crashes, appropriate error states

Breaking Changes

None. This feature is completely opt-in and backward compatible:

  • Default behavior: burnRateType: static (existing behavior)
  • Existing SLOs: Continue working unchanged
  • Migration: Add burnRateType: dynamic to enable new behavior
  • No schema changes that break existing deployments

Migration Guide

Enabling Dynamic Burn Rates

For new SLOs, add burnRateType: dynamic to the alerting section:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: my-service-slo
spec:
  target: "99"
  window: 28d
  indicator:
    ratio:
      errors:
        metric: http_requests_total{code=~"5.."}
      total:
        metric: http_requests_total
  alerting:
    name: MyServiceErrorBudgetBurn
    burnRateType: dynamic  # Add this line
    burnrates: true

For existing SLOs, edit the SLO YAML and add burnRateType: dynamic:

kubectl edit slo my-service-slo -n monitoring

Validation

After enabling dynamic burn rates:

  1. Check Prometheus Rules: Verify dynamic expressions generated

    kubectl get prometheusrule -n monitoring
  2. Check UI: Verify green "Dynamic" badge appears on SLO list page

  3. Check Thresholds: Verify calculated threshold values display in alerts table

  4. Monitor Alerts: Observe alert behavior with traffic variations

Rollback

To revert to static burn rates:

alerting:
  burnRateType: static  # Change back to static

Or remove the field entirely (defaults to static).

Examples

Four comprehensive examples are provided in the examples/ directory:

  1. examples/dynamic-burn-rate-ratio.yaml - Ratio indicator (API success rate)
  2. examples/dynamic-burn-rate-latency.yaml - Latency indicator (histogram-based)
  3. examples/dynamic-burn-rate-latency-native.yaml - Native histogram latency
  4. examples/dynamic-burn-rate-bool-gauge.yaml - Boolean gauge (availability)

Each example includes:

  • Clear comments explaining use cases
  • Proper metric selectors
  • Recommended configuration values
  • Traffic-aware alerting benefits

Design Decisions

1. Opt-In Feature (Not Default)

Decision: Dynamic burn rates require explicit burnRateType: dynamic configuration

Rationale:

  • Preserves existing behavior for current users
  • Allows gradual adoption and testing
  • Reduces risk of unexpected alert behavior changes
  • Users can evaluate feature before full deployment

2. Latency Indicator Label Selector

Decision: Always add le="" label selector when querying latency recording rules

Rationale:

  • Latency indicators create TWO recording rules: total (le="") and success (le="0.1")
  • Without le="", sum() aggregation includes BOTH rules (2x traffic)
  • Explicit le="" selector ensures only total traffic is counted
  • Critical for accurate dynamic threshold calculation

3. Error Handling Strategy

Decision: Graceful degradation with fallback displays instead of crashes

Rationale:

  • Production environments may have missing or misconfigured metrics
  • Users need visibility into SLO configuration even with data issues
  • Fallback to "Traffic-Aware" or "No data" better than white page crash
  • Console warnings help debugging without breaking UI

Documentation

User-Facing Documentation

Updated Files:

  • README.md - Added dynamic burn rate feature section
  • examples/README.md - Added dynamic SLO examples with explanations
  • examples/*.yaml - Four comprehensive example configurations

Development Documentation (Fork Only)

Comprehensive Internal Documentation (40+ documents in .dev-docs/):

  • Implementation summaries and session notes
  • Testing procedures and validation reports
  • Mathematical correctness validation
  • Performance benchmarks and optimization analysis
  • Browser compatibility matrices
  • Migration guides and troubleshooting
  • Development workflow and standards

Development Tools (Fork Only - cmd/ directory):

  • Query performance validation tools
  • Threshold calculation testing tools
  • Alert rule validation tools
  • Recording rule validation tools
  • Synthetic metric generation for testing
  • Performance monitoring tools

References

Methodology:


Before/After Examples

Example 1: Static vs Dynamic Threshold Comparison

Scenario: API service with 99% SLO target, 30d window

Static Burn Rate (Factor 14):

Threshold = 14 × (1 - 0.99) = 0.14 (14% error rate)
  • Same threshold regardless of traffic
  • 14% error rate required to trigger alert
  • Does not adapt to traffic patterns

Dynamic Burn Rate (High Traffic):

N_SLO (30d): 1,000,000 requests
N_alert (1h): 10,000 requests
Traffic Ratio: 100x
E_budget_percent: 0.02 (2% of error budget)

Threshold = 100 × 0.02 × 0.01 = 0.02 (2% error rate)
Absolute errors needed = 10,000 × 0.02 = 200 errors
  • Lower threshold percentage (2%)
  • Same absolute errors needed (200 errors)

Dynamic Burn Rate (Low Traffic):

N_SLO (30d): 1,000,000 requests
N_alert (1h): 1,000 requests
Traffic Ratio: 1,000x
E_budget_percent: 0.02 (2% of error budget)

Threshold = 1,000 × 0.02 × 0.01 = 0.2 (20% error rate)
Absolute errors needed = 1,000 × 0.2 = 200 errors
  • Higher threshold percentage (20%)
  • Same absolute errors needed (200 errors)
  • Prevents false positives from small sample sizes

Key Insight: Both scenarios require the same absolute number of errors (200), but the error rate thresholds differ dramatically (2% vs 20%).

Example 2: UI Display Comparison

Static SLO - List Page:

┌─────────────────────────────────────────────────┐
│ Name: apiserver-requests-static                 │
│ Burn Rate: [Static] 🔒                          │
│ Availability: 99.95%                            │
│ Budget: 95.2%                                   │
└─────────────────────────────────────────────────┘

Dynamic SLO - List Page:

┌─────────────────────────────────────────────────┐
│ Name: apiserver-requests-dynamic                │
│ Burn Rate: [Dynamic] 👁                         │
│ Availability: 99.95%                            │
│ Budget: 95.2%                                   │
└─────────────────────────────────────────────────┘

Static SLO - Alerts Table:

┌──────────┬──────────┬────────────┬────────┬───────────┐
│ Severity │ Exhaust  │ Factor     │ Thresh │ Short     │
├──────────┼──────────┼────────────┼────────┼───────────┤
│ critical │ 2d       │ 14         │ 0.700  │ 0.123     │
│ critical │ 6d       │ 7          │ 0.350  │ 0.456     │
│ warning  │ 12d      │ 2          │ 0.100  │ 0.789     │
│ warning  │ 30d      │ 1          │ 0.050  │ 0.234     │
└──────────┴──────────┴────────────┴────────┴───────────┘

Dynamic SLO - Alerts Table:

┌──────────┬──────────┬────────────┬────────┬───────────┐
│ Severity │ Exhaust  │ Error Bdgt │ Thresh │ Short     │
├──────────┼──────────┼────────────┼────────┼───────────┤
│ critical │ 2d       │ 2.08%      │ 0.0046 │ 0.123     │
│ critical │ 6d       │ 6.25%      │ 0.0137 │ 0.456     │
│ warning  │ 12d      │ 7.14%      │ 0.0156 │ 0.789     │
│ warning  │ 30d      │ 14.29%     │ 0.0313 │ 0.234     │
└──────────┴──────────┴────────────┴────────┴───────────┘

Tooltip Comparison:

Static SLO Tooltip:

Static Burn Rate
Uses fixed multipliers (14x, 7x, 2x, 1x) for alert thresholds.
Threshold = 14 × (1 - 0.99) = 0.700

Dynamic SLO Tooltip:

Dynamic Burn Rate
Adapts thresholds based on traffic patterns.

Error Budget: 2.08% (burns 2.08% of budget per alert window)
Traffic Ratio: 100x (high traffic)
Dynamic Threshold: 0.02 (2%) - Adapts to traffic
Static Threshold: 0.14 (14%) - Same threshold regardless of traffic

Formula: (N_SLO / N_alert) × E_budget_percent × (1 - SLO_target)

Example 3: Alert Rule Comparison

Static Alert Rule (PrometheusRule):

- alert: ApiserverRequestsStaticErrorBudgetBurn
  expr: |
    (
      apiserver_request:burnrate5m{slo="apiserver-requests-static"} > (14 * (1 - 0.99))
      and
      apiserver_request:burnrate1h4m{slo="apiserver-requests-static"} > (14 * (1 - 0.99))
    )
  labels:
    severity: critical
    long: 1h4m
    short: 5m

Dynamic Alert Rule (PrometheusRule):

- alert: ApiserverRequestsDynamicErrorBudgetBurn
  expr: |
    (
      apiserver_request:burnrate5m{slo="apiserver-requests-dynamic"} > 
      scalar((sum(apiserver_request:increase30d{slo="apiserver-requests-dynamic"}) / 
              sum(increase(apiserver_request_total{verb="GET"}[1h4m]))) * 0.020833 * (1 - 0.99))
      and
      apiserver_request:burnrate1h4m{slo="apiserver-requests-dynamic"} > 
      scalar((sum(apiserver_request:increase30d{slo="apiserver-requests-dynamic"}) / 
              sum(increase(apiserver_request_total{verb="GET"}[1h4m]))) * 0.020833 * (1 - 0.99))
    )
  labels:
    severity: critical
    long: 1h4m
    short: 5m

Key Differences:

  1. Static uses fixed multiplier (14)
  2. Dynamic calculates traffic ratio (N_SLO / N_alert)
  3. Dynamic uses E_budget_percent (0.020833 = 1/48)
  4. Dynamic uses recording rules for SLO window (optimized)
  5. Both use same burn rate recording rules for error rate

yairst added 30 commits August 22, 2025 19:10
- Add DynamicBurnRate type and configuration
- Implement GetRemainingErrorBudget calculation
- Clean up redundant code in rules.go
- Improve Windows() function readability
- Add dynamic factor scaling based on remaining error budget
- Add space before inline comments
- Remove trailing whitespace
- Normalize newlines in functions
- Improve code readability
Add dynamic burn rate calculation that uses error budget percentages:
- 1/48 (2.08%) per hour (50% per day)
- 1/16 (6.25%) per 6h (100% per 4 days)
- 1/14 (7.14%) per day
- 1/7 (14.28%) per 4 days

Implements dynamic burn rate calculation formula:
(increase[slo_window] / increase[alert_window]) * error_budget_percent

The implementation preserves existing window periods while adding
proper error budget burn percentages for more accurate alerting.
- Add core dynamic burn rate logic for Ratio indicators
- Implement buildAlertExpr() and buildDynamicAlertExpr() methods
- Add dynamic threshold calculation: (N_SLO/N_alert) × E_budget_percent_threshold × (1-SLO_target)
- Support traffic-aware alerting with proper PromQL generation
- Maintain backward compatibility with static burn rate as default
- Add comprehensive unit tests for both static and dynamic modes
- Update test expectations to reflect 'static' as default BurnRateType

Dynamic burn rate adapts alert thresholds to traffic volume:
- Higher traffic periods get proportionally higher thresholds
- Lower traffic periods get lower thresholds
- Uses increase() functions for event counting over SLO and alert windows
- Properly handles metric selectors and label matchers for error/total metrics

All tests passing. Ready for extension to other indicator types.
… summary

- Create comprehensive sli-indicator-types.md explaining all four indicator types
- Document purpose, use cases, and burn rate calculations for each type
- Explain why different indicator types exist and how they map to different metric formats
- Update FEATURE_IMPLEMENTATION_SUMMARY.md with latest implementation status
- Add clarifications about E_budget_percent_threshold being constants
- Document current capabilities and remaining work priorities
- Explain implementation strategy for indicator type support
- Fix 'Burn Rate Calculation' → 'Error Rate Calculation' in SLI indicator types
- Update success criteria to reflect completed work:
  - API support: ✅ Complete
  - Dynamic alert thresholds: ✅ Complete (for Ratio indicators)
  - Traffic adaptation: ✅ Complete (for Ratio indicators)
  - Backward compatibility: ✅ Complete
  - Documentation: ✅ Complete
  - Performance validation: ✅ Complete (in tests)
- Update status to reflect Priority 1 completion
� Core Implementation:
- Extended buildDynamicAlertExpr() to support Latency indicators
- Updated Burnrates() method for Latency case to use dynamic windows
- Added helper methods buildTotalSelector() and buildLatencyTotalSelector()

⚡ Performance Optimization:
- Both Ratio and Latency indicators now use recording rules for efficiency
- Alert expressions use pre-computed burn rates + dynamic threshold calculation
- Significantly reduces Prometheus evaluation load vs inline calculations

� Comprehensive Testing:
- Added TestObjective_DynamicBurnRate_Latency() test
- Extended TestObjective_buildAlertExpr() with Latency test cases
- Updated test expectations for optimized recording rule usage
- All tests pass with new implementation

� Current Support Status:
- ✅ Ratio Indicators: Full dynamic burn rate support
- ✅ Latency Indicators: Full dynamic burn rate support (NEW)
- ⏳ LatencyNative & BoolGauge: Fall back to static (TODO)

� Examples & Documentation:
- Added examples/latency-dynamic-burnrate.yaml with practical configs
- Updated feature implementation summary and SLI documentation
- Documented performance improvements and implementation approach

The implementation is production-ready and maintains full backward compatibility.
- Add CORE_CONCEPTS_AND_TERMINOLOGY.md with authoritative definitions for:
  * Error Rate, Error Budget, Burn Rate concepts
  * Static vs Dynamic burn rate threshold differences
  * Traffic scaling factor (N_SLO / N_alert) explanation
  * False positive/negative prevention mechanisms
  * Mathematical relationships and PromQL patterns

- Update FEATURE_IMPLEMENTATION_SUMMARY.md to reference core concepts doc
- Streamline implementation summary to focus on status and progress
- Establish single source of truth for conceptual understanding

These docs capture the corrected understanding of dynamic burn rate
concepts for future code review sessions and development work.
Core Implementation Fixes:
- Fix multi-window logic to use N_long for both windows (consistent traffic scaling)
- Remove unused dynamicBurnRateExpr() function (code cleanup)
- Fix DynamicWindows() to use scaled periods from Windows(sloWindow)
- Map E_budget_percent_thresholds by static factor hierarchy (14→1/48, etc.)

Key Behavioral Corrections:
- Both short and long windows now use N_long denominator for traffic scaling
- Window periods properly scale with any SLO duration via Windows() function
- E_budget_percent_thresholds remain constant across SLO period choices
- Window.Factor correctly serves as E_budget_percent_threshold in dynamic mode

Documentation Updates:
- Add multi-window logic explanation to CORE_CONCEPTS_AND_TERMINOLOGY.md
- Add Window.Factor dual purpose design documentation
- Add window period scaling details and architectural insights
- Update FEATURE_IMPLEMENTATION_SUMMARY.md with recent fixes
- Correct formula from (N_SLO / N_alert) to (N_SLO / N_long)

All tests pass including TestObjective_DynamicBurnRate and TestObjective_DynamicBurnRate_Latency.
Mathematical implementation now correctly matches the expected dynamic burn rate formula.
✅ Code Review Completed - Production Ready Status
- Updated FEATURE_IMPLEMENTATION_SUMMARY.md with code review completion
- Confirmed production readiness for Ratio & Latency indicators
- Added edge case handling validation results
- Updated PromQL examples to show recording rule implementation
- Documented comprehensive test coverage completion

✅ Session Continuation Updates
- Updated SESSION_CONTINUATION_PROMPT.md with correct status
- Removed non-existent compilation error references
- Added production readiness confirmation
- Updated next priority tasks for remaining indicator types

Status: Dynamic burn rate implementation for Ratio & Latency indicators
is production-ready and fully validated through comprehensive code review.
…ypes

- Extend dynamic burn rate support to LatencyNative and BoolGauge indicators
- Add buildLatencyNativeTotalSelector() and buildBoolGaugeSelector() helper methods
- Implement traffic-aware expressions for native histograms and boolean gauges
- Add dynamic window logic to LatencyNative and BoolGauge cases in Burnrates()
- Replace hardcoded alert expressions with unified buildAlertExpr() method
- Add comprehensive test coverage for all indicator types
- All backend dynamic burn rate logic now complete and production-ready

Backend implementation status:
✅ Ratio indicators - Dynamic burn rate complete
✅ Latency indicators - Dynamic burn rate complete
✅ LatencyNative indicators - Dynamic burn rate complete
✅ BoolGauge indicators - Dynamic burn rate complete

Next: UI integration and Grafana dashboard updates (future sessions)
- Create new prompts/ folder for session organization
- Move existing session prompts from .dev-docs/ to prompts/
- Add NEXT_SESSION_PROMPT.md focused on React UI integration
- Add prompts/README.md documenting session strategy

Next session focus:
- React UI integration for BurnRateType selection
- Update SLO forms to support dynamic burn rate configuration
- Backend implementation complete and ready for frontend work
- NEXT_SESSION_PROMPT.md → UI_INTEGRATION_SESSION_PROMPT.md
- SESSION_CONTINUATION_PROMPT.md → BACKEND_COMPLETION_SESSION_PROMPT.md
- Update README.md with new prompt names and usage guide

Better organization:
- Clear indication of session purpose and focus area
- Active vs completed session status
- Easy identification of which prompt to use next
- Add burn rate type display system with color-coded badges
- Implement burn rate column in SLO list with sorting and visibility controls
- Add burn rate information section to SLO detail pages
- Create TypeScript infrastructure with BurnRateType enum and utilities
- Add dynamic/static icons for visual distinction
- Implement responsive design with tooltips and accessibility
- Create demo SLO configurations for testing
- Add comprehensive UI documentation
- Update feature implementation status
- Prepare next session prompt for API integration

The UI foundation is now complete with mock detection logic.
Next phase: API integration to eliminate mock data and connect
to actual backend burn rate type field.
✅ All 5 core tasks completed:

1. Added Alerting message with burnRateType field to protobuf schema
2. Updated Go conversion functions (ToInternal/FromInternal) in objectives.go
3. Regenerated TypeScript protobuf definitions and implementations
4. Replaced mock detection logic with real API field access in burnrate.tsx
5. Validated end-to-end API integration with comprehensive testing

� Technical Implementation:
- Protobuf: Added Alerting message with string burn_rate_type field
- Go: Complete bidirectional conversion between internal structs and protobuf
- TypeScript: Manual updates for Windows compatibility with proper interfaces
- Frontend: Real API field access (objective.alerting?.burnRateType)
- Testing: Round-trip validation for both 'dynamic' and 'static' types

� Status: API Integration Complete - Production Ready
� Next: Priority 2 Alert Display Updates
yairst added 29 commits October 8, 2025 20:49
…mic burn rates

- Verified all 5 generic rules work identically for static and dynamic SLOs
- Confirmed Grafana dashboards display both SLO types correctly without modifications
- Validated error budget calculations use same formula for both types
- Tested list and detail dashboards with mixed static/dynamic SLOs
- Documented pre-existing Rate graph query bug (unrelated to feature)
- Created comprehensive validation session document
- Result: NO CHANGES NEEDED - dashboards work perfectly with dynamic SLOs
- Analyzed BurnRateThresholdDisplay implementation (uses raw metrics)
- Validated recording rules provide 40x speedup for ratio indicators
- Created validation tools for performance testing
- Documented optimization strategy and performance benchmarks
- Created sub-tasks 7.10.1-7.10.4 for implementation phase

Analysis documents:
- TASK_7.10_UI_QUERY_OPTIMIZATION_ANALYSIS.md - Full analysis
- TASK_7.10_VALIDATION_RESULTS.md - Performance benchmarks
- TASK_7.10_COMPLETION_SUMMARY.md - Phase 1 summary

Validation tools:
- cmd/validate-ui-query-optimization - Performance comparison
- cmd/test-burnrate-threshold-queries - Query validation

Key findings:
- Ratio indicators: 694ms -> 17ms (40x speedup potential)
- Latency indicators: 43ms -> 26ms (1.7x speedup potential)
- Recording rules exist but UI doesn't use them yet
- Implementation will happen in sub-tasks 7.10.1-7.10.4
- Fixed test queries to use only SLO window recording rules (not alert windows)
- Added statistical rigor: 10 iterations per query with min/max/avg analysis
- Added BoolGauge indicator testing (all three types now covered)
- Executed tests and documented real performance measurements:
  * Ratio: 7.17x speedup (48.75ms -> 6.80ms)
  * Latency: 2.20x speedup (6.34ms -> 2.89ms)
  * BoolGauge: No benefit (already fast at 3ms)
- Clarified terminology: SLO window vs alert windows
- Key finding: Only SLO window has increase/count recording rules
- Updated task 7.10.2 with validation findings and implementation guide
- Created comprehensive implementation guide for next task
- Add hybrid query approach: recording rules for SLO window + inline for alert windows
- Implement getBaseMetricName() to strip metric suffixes for recording rule naming
- Implement getTrafficRatioQueryOptimized() for optimized query generation
- Optimize ratio indicators (7.17x query speedup) and latency indicators (2.20x speedup)
- Skip boolGauge optimization (already fast at 3ms)
- Fix performance monitoring bug (was showing 677s due to never-reset timer)
- Primary benefit: Prometheus load reduction, not UI speed (network overhead dominates)
- Maintains backward compatibility with fallback to raw metrics

Task 7.10.2 complete
- Add references to TASK_7.10_VALIDATION_RESULTS.md, TASK_7.10_IMPLEMENTATION.md, and TASK_7.10.1_TEST_IMPROVEMENTS.md
- Include key findings from 7.10.2: network overhead dominates, main benefit is Prometheus load reduction
- Update 7.10.4 to reflect validation already completed in 7.10.2
- Clarify that optimization provides minimal UI benefit but significant infrastructure benefit
- Created comprehensive decision document analyzing backend optimization
- Documented current implementation vs optimized pattern
- Calculated performance benefits: 7x for ratio, 2x for latency indicators
- Production impact: ~1.77M seconds/year saved for ratio indicators at scale
- Decision: IMPLEMENT optimization (primary benefit: Prometheus load reduction)
- Added Task 7.10.5 to implementation plan for backend optimization
- Updated feature implementation summary with Task 7.10.3 completion

Key findings:
- Alert rules evaluated every 30s (different profile than UI on-demand queries)
- Main benefit is infrastructure load reduction, not alert evaluation speed
- Hybrid approach: recording rule for SLO window + inline for alert windows
- Consistent with UI implementation (Task 7.10.2)
- Priority: MEDIUM-HIGH, implement after Task 7.10.4

References:
- .dev-docs/TASK_7.10.3_BACKEND_OPTIMIZATION_DECISION.md
- .dev-docs/FEATURE_IMPLEMENTATION_SUMMARY.md
- .kiro/specs/dynamic-burn-rate-completion/tasks.md
- Added getBaseMetricName() helper function to strip metric suffixes
- Updated buildDynamicAlertExpr() for ratio indicators to use hybrid approach (recording rules for SLO window)
- Updated buildDynamicAlertExpr() for latency indicators to use hybrid approach
- Skipped boolGauge optimization (already fast, no benefit)
- Fixed UI regression: BurnRateThresholdDisplay now uses actual SLO window instead of hardcoded 30d
- This fixes 'no data available' issue for synthetic SLOs with 1d window
- Backend optimization provides 7x speedup for ratio, 2x for latency indicators
- Primary benefit: Prometheus CPU/memory load reduction at scale
- Fixed critical latency threshold bug (2x traffic counting)
  - Added le="" label selector in UI and backend for latency indicators
  - Prevents summing both total and success recording rules
- Fixed BurnrateGraph to show dynamic thresholds over time
  - Changed from instant to range queries for traffic calculation
- Fixed React console warnings
  - Toggle: Added readOnly attribute
  - Detail.tsx: Fixed duplicate keys
  - AlertsTable: Added Fragment keys
  - DurationGraph: Added null checks
- Removed debug logging from production code
- Validated all indicator types (Ratio, Latency, BoolGauge, LatencyNative)
- Updated documentation and steering standards
- Created SLO generator tool with window variation (7d, 28d, 30d)
- Created performance monitoring tool for metrics collection
- Created automated test script for health checks and validation
- Generated 50 test SLOs ready for scale testing
- Consolidated documentation into TASK_7.11_TESTING_INFRASTRUCTURE.md
- Created TASK_7.12_MANUAL_TESTING_GUIDE.md for interactive testing
- Updated tasks.md with proper task structure and references
- Cleaned up redundant documentation files
- Executed baseline performance test with 16 current SLOs
- Applied and tested 50 additional SLOs (medium scale: 66 total)
- Applied and tested 100 additional SLOs (large scale: 116 total)
- Collected comprehensive performance metrics (API response time, memory usage, Prometheus query performance)
- Created PRODUCTION_PERFORMANCE_BENCHMARKS.md with detailed analysis
- Key findings: Sub-linear API scaling, near-constant memory usage, stable Prometheus performance
- Production readiness assessment: READY
- Updated gitignore to exclude temporary test binaries and JSON metrics files
…ul degradation

- Tested Chrome and Firefox (both PASS - identical behavior)
- Tested graceful degradation: network throttling, API failures, Prometheus unavailability (all PASS)
- Tested migration: static to dynamic, rollback, backward compatibility (all PASS)
- Created browser compatibility matrix with test results and recommendations
- Created comprehensive migration guide (validated during testing)
- Discovered and documented 3 issues (1 HIGH severity, 2 LOW severity)
- Created Task 7.12.1 for critical bug fix (white page crash for missing metrics)

Deliverables:
- .dev-docs/BROWSER_COMPATIBILITY_MATRIX.md - Complete test results
- .dev-docs/MIGRATION_GUIDE.md - Migration instructions and best practices
- .dev-docs/TASK_7.12_TESTING_COMPLETION_SUMMARY.md - Testing summary

Production readiness: Ready for environments with reliable metrics. Fix Task 7.12.1 before deploying to environments with potentially missing metrics.
- Fix BurnrateGraph white page crash for dynamic SLOs with missing metrics
  - Add comprehensive null/undefined checks before Array.from() calls
  - Wrap dynamic threshold calculation in try-catch for graceful error handling
  - Fallback to static threshold when traffic data is missing/broken
  - Add console warnings for debugging

- Fix Detail page showing 100% instead of 'No data' for missing metrics
  - Change default values from errors=0, total=1 to undefined
  - Tiles now correctly display 'No data' (consistent with main page)

- Update documentation with fix details and testing coverage
- Performed systematic regression testing against upstream-comparison branch
- Validated production build with all recent fixes (Task 7.12.1)
- Zero regressions found - all original Pyrra functionality preserved
- All 4 production build tests passed successfully
- Tested 16 SLOs (4 static, 12 dynamic) in mixed environment

Regression Testing Results:
- Static SLO behavior identical to baseline (except intentional enhancements)
- 6 intentional new features successfully integrated
- No visual glitches, layout issues, or console errors
- Auto-reload confirmed as original Pyrra behavior (not a regression)

Production Build Validation:
- Critical Task 7.12.1 fixes working perfectly (no white page crash)
- All indicator types working correctly (ratio, latency, latencyNative, boolGauge)
- Graceful error handling for missing/broken metrics
- Performance acceptable (< 3 seconds page load)
- 1 minor cosmetic issue found (false console warning - not blocking)

Key Findings:
- Backend service required for proper burn rate type detection
- Mixed static/dynamic environment stable and working correctly
- Feature is production ready for upstream contribution

Documentation:
- Created comprehensive test results document
- Created step-by-step testing procedure guide
- Created quick reference checklist
- Updated feature implementation summary

Status: PRODUCTION READY - Zero blockers, ready for upstream contribution
- Restructured Task 8 to focus on upstream integration (fetch/merge, file organization, production docs, PR description)
- Streamlined Task 9 to reference existing validation work (Tasks 1-7 already complete)
- Added UPSTREAM_CONTRIBUTION_PLAN.md with file organization strategy and PR preparation guide
- Emphasized keeping production documentation updates concise and proportional
- Removed duplicate testing/documentation tasks already completed in Tasks 1-7
- Added Task 8.0 as mandatory pre-merge cleanup step (must do before Task 8.1)
- Created comprehensive cleanup checklist in TASK_8.0_PRE_MERGE_CLEANUP_CHECKLIST.md
- Addresses manual code review findings:
  - Revert unintended changes (CONTRIBUTING.md, deployment manifests, index.html, etc.)
  - Move examples from .dev/ to examples/
  - Backend code cleanup (remove duplicates in slo/rules.go, unused code in slo/slo.go)
  - CRD cleanup (remove redundant variables)
  - Test file review and decisions
  - UI code review (Toggle.tsx, old docs)
  - Investigate filesystem.go changes and determine testing needs
  - Investigate proto changes
- Updated Task 9.3 to reference filesystem mode testing decision from Task 8.0
- Updated UPSTREAM_CONTRIBUTION_PLAN.md timeline to include Task 8.0
- Reverted unintended changes (pyrra-kubernetesDeployment.yaml, ui/public/index.html, filesystem.go)
- Removed unused code (~47 lines from slo/slo.go and CRD types)
- Updated comment format in slo/rules.go
- Moved ui/DYNAMIC_BURN_RATE_UI.md to .dev-docs/HISTORICAL_UI_DESIGN.md
- Updated CONTRIBUTING.md with ui/README.md reference
- Clarified architecture: test metric only needed in API server (main.go)
- Created comprehensive cleanup documentation

All tests passing, code compiles successfully.
- Add 4 dynamic burn rate examples (ratio, latency, latencyNative, boolGauge)
- Create concise examples/README.md (~70 lines, comparable to upstream)
- Use real metrics from actual services (apiserver, prometheus, pyrra)
- Minimal comments consistent with existing examples
- All examples verified showing actual data in Pyrra UI
- Delete redundant latency-dynamic-burnrate.yaml and simple-demo.yaml
- Add Task 8.5 for regex label selector investigation
- Update documentation to reflect 4 examples

Task 8.2 complete
- Task 8.4.1: Comprehensive upstream comparison testing
  - Tested regex selectors on upstream-comparison branch
  - Created test SLO configurations for validation
  - Confirmed no regressions from feature branch

- Task 8.4.2: Root cause analysis
  - Identified grouping creates multiple SLOs (upstream behavior)
  - Identified NaN display issue (upstream cosmetic bug)
  - Documented technical architecture and design

- Task 8.4.3: Solution implementation
  - Chose documentation approach (no code changes needed)
  - Updated KNOWN_LIMITATIONS.md with user guidance
  - Provided best practices and workarounds

Key findings:
- Regex selectors work correctly in both upstream and feature branch
- Multiple SLO behavior with grouping is existing upstream design
- NaN issue affects all SLOs universally (not regex-specific)
- No regressions introduced by dynamic burn rate feature
- Feature ready for upstream contribution

Documentation created:
- .dev-docs/UPSTREAM_COMPARISON_REGEX_SELECTORS.md (complete test results)
- .dev-docs/KNOWN_LIMITATIONS.md (user-facing guidance)
- .dev-docs/TASK_8.4.{1,2,3}_*.md (sub-task documentation)
- .dev-docs/TASK_8.4_COMPLETE_SUMMARY.md (overall summary)
- Fixed duplicate task 8.3 (renamed second one to 8.5)
- Marked task 8.4 as complete [x]
- Marked task 8.4.3 as complete [x] (was [-])
- Renumbered 'Create pull request' task from 8.5 to 8.6

Task order now:
- 8.3: Organize files for PR vs fork separation
- 8.4: Investigate regex label selector (COMPLETE)
- 8.5: Update production documentation
- 8.6: Create pull request description
- Add concise dynamic burn rate section to README.md with Dev.to article reference
- Enhance examples/README.md with usage guidance and migration notes
- Add inline comments to all 4 dynamic burn rate example files
- Correct mathematical explanations (high traffic = lower threshold)
- Document task completion and mathematical correction
- Follow 'concise and proportional' principle - dynamic burn rate is ONE feature

Files updated:
- README.md: Added Dynamic Burn Rate Alerting section
- examples/README.md: Enhanced dynamic burn rate examples section
- examples/dynamic-burn-rate-*.yaml: Added header and inline comments (4 files)
- .dev-docs/: Added task documentation and math correction notes
- .dev-docs/FEATURE_IMPLEMENTATION_SUMMARY.md: Updated with task 8.5 completion
- Created comprehensive file categorization document
- Documented 10 categories: PR files vs fork files
- Defined preservation strategy for dev artifacts
- Provided 7-step action plan for file organization
- Added verification checklist with 14 items
- Updated feature implementation summary
- Remove all development-only files (.dev-docs, .kiro, cmd/, scripts/, prompts/, testing/)
- Remove custom Docker files (Dockerfile.custom, Dockerfile.dev)
- Update test expectations for le='' label on latency recording rules
- Update test expectations for errors recording rules on ratio indicators
- Add default Alerting field values to test objectives
- Changes reflect query optimization work (task 7.10) and file organization (task 8.3)

All tests passing, builds successful (backend + UI)
- Remove debug console.log statements from List.tsx and AlertsTable.tsx
- Improve error logging (console.log -> console.error with context)
- Part of code quality and standards review for upstream contribution

Cherry-picked UI changes from dev-tools-and-docs branch (d13b8fc)
@metalmatze
Copy link
Member

Wow!
This is incredible!

First thing, I'll have to read your blog post and fully understand how things are working.
Then, if we decide on adding the feature to Pyrra, I would like to know if we should go and split the features into various PRs. Currently, it's going to be quite hard to add with a gigantic PR, and probably won't be a good experience reviewing all files within one PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants