-
Notifications
You must be signed in to change notification settings - Fork 134
feat: Add dynamic burn rate alerting for traffic-aware SLO thresholds #1602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yairst
wants to merge
95
commits into
pyrra-dev:main
Choose a base branch
from
yairst:add-dynamic-burn-rate
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add DynamicBurnRate type and configuration - Implement GetRemainingErrorBudget calculation - Clean up redundant code in rules.go - Improve Windows() function readability - Add dynamic factor scaling based on remaining error budget
- Add space before inline comments - Remove trailing whitespace - Normalize newlines in functions - Improve code readability
Add dynamic burn rate calculation that uses error budget percentages: - 1/48 (2.08%) per hour (50% per day) - 1/16 (6.25%) per 6h (100% per 4 days) - 1/14 (7.14%) per day - 1/7 (14.28%) per 4 days Implements dynamic burn rate calculation formula: (increase[slo_window] / increase[alert_window]) * error_budget_percent The implementation preserves existing window periods while adding proper error budget burn percentages for more accurate alerting.
- Add core dynamic burn rate logic for Ratio indicators - Implement buildAlertExpr() and buildDynamicAlertExpr() methods - Add dynamic threshold calculation: (N_SLO/N_alert) × E_budget_percent_threshold × (1-SLO_target) - Support traffic-aware alerting with proper PromQL generation - Maintain backward compatibility with static burn rate as default - Add comprehensive unit tests for both static and dynamic modes - Update test expectations to reflect 'static' as default BurnRateType Dynamic burn rate adapts alert thresholds to traffic volume: - Higher traffic periods get proportionally higher thresholds - Lower traffic periods get lower thresholds - Uses increase() functions for event counting over SLO and alert windows - Properly handles metric selectors and label matchers for error/total metrics All tests passing. Ready for extension to other indicator types.
… summary - Create comprehensive sli-indicator-types.md explaining all four indicator types - Document purpose, use cases, and burn rate calculations for each type - Explain why different indicator types exist and how they map to different metric formats - Update FEATURE_IMPLEMENTATION_SUMMARY.md with latest implementation status - Add clarifications about E_budget_percent_threshold being constants - Document current capabilities and remaining work priorities - Explain implementation strategy for indicator type support
- Fix 'Burn Rate Calculation' → 'Error Rate Calculation' in SLI indicator types - Update success criteria to reflect completed work: - API support: ✅ Complete - Dynamic alert thresholds: ✅ Complete (for Ratio indicators) - Traffic adaptation: ✅ Complete (for Ratio indicators) - Backward compatibility: ✅ Complete - Documentation: ✅ Complete - Performance validation: ✅ Complete (in tests) - Update status to reflect Priority 1 completion
� Core Implementation: - Extended buildDynamicAlertExpr() to support Latency indicators - Updated Burnrates() method for Latency case to use dynamic windows - Added helper methods buildTotalSelector() and buildLatencyTotalSelector() ⚡ Performance Optimization: - Both Ratio and Latency indicators now use recording rules for efficiency - Alert expressions use pre-computed burn rates + dynamic threshold calculation - Significantly reduces Prometheus evaluation load vs inline calculations � Comprehensive Testing: - Added TestObjective_DynamicBurnRate_Latency() test - Extended TestObjective_buildAlertExpr() with Latency test cases - Updated test expectations for optimized recording rule usage - All tests pass with new implementation � Current Support Status: - ✅ Ratio Indicators: Full dynamic burn rate support - ✅ Latency Indicators: Full dynamic burn rate support (NEW) - ⏳ LatencyNative & BoolGauge: Fall back to static (TODO) � Examples & Documentation: - Added examples/latency-dynamic-burnrate.yaml with practical configs - Updated feature implementation summary and SLI documentation - Documented performance improvements and implementation approach The implementation is production-ready and maintains full backward compatibility.
- Add CORE_CONCEPTS_AND_TERMINOLOGY.md with authoritative definitions for: * Error Rate, Error Budget, Burn Rate concepts * Static vs Dynamic burn rate threshold differences * Traffic scaling factor (N_SLO / N_alert) explanation * False positive/negative prevention mechanisms * Mathematical relationships and PromQL patterns - Update FEATURE_IMPLEMENTATION_SUMMARY.md to reference core concepts doc - Streamline implementation summary to focus on status and progress - Establish single source of truth for conceptual understanding These docs capture the corrected understanding of dynamic burn rate concepts for future code review sessions and development work.
Core Implementation Fixes: - Fix multi-window logic to use N_long for both windows (consistent traffic scaling) - Remove unused dynamicBurnRateExpr() function (code cleanup) - Fix DynamicWindows() to use scaled periods from Windows(sloWindow) - Map E_budget_percent_thresholds by static factor hierarchy (14→1/48, etc.) Key Behavioral Corrections: - Both short and long windows now use N_long denominator for traffic scaling - Window periods properly scale with any SLO duration via Windows() function - E_budget_percent_thresholds remain constant across SLO period choices - Window.Factor correctly serves as E_budget_percent_threshold in dynamic mode Documentation Updates: - Add multi-window logic explanation to CORE_CONCEPTS_AND_TERMINOLOGY.md - Add Window.Factor dual purpose design documentation - Add window period scaling details and architectural insights - Update FEATURE_IMPLEMENTATION_SUMMARY.md with recent fixes - Correct formula from (N_SLO / N_alert) to (N_SLO / N_long) All tests pass including TestObjective_DynamicBurnRate and TestObjective_DynamicBurnRate_Latency. Mathematical implementation now correctly matches the expected dynamic burn rate formula.
✅ Code Review Completed - Production Ready Status - Updated FEATURE_IMPLEMENTATION_SUMMARY.md with code review completion - Confirmed production readiness for Ratio & Latency indicators - Added edge case handling validation results - Updated PromQL examples to show recording rule implementation - Documented comprehensive test coverage completion ✅ Session Continuation Updates - Updated SESSION_CONTINUATION_PROMPT.md with correct status - Removed non-existent compilation error references - Added production readiness confirmation - Updated next priority tasks for remaining indicator types Status: Dynamic burn rate implementation for Ratio & Latency indicators is production-ready and fully validated through comprehensive code review.
…ypes - Extend dynamic burn rate support to LatencyNative and BoolGauge indicators - Add buildLatencyNativeTotalSelector() and buildBoolGaugeSelector() helper methods - Implement traffic-aware expressions for native histograms and boolean gauges - Add dynamic window logic to LatencyNative and BoolGauge cases in Burnrates() - Replace hardcoded alert expressions with unified buildAlertExpr() method - Add comprehensive test coverage for all indicator types - All backend dynamic burn rate logic now complete and production-ready Backend implementation status: ✅ Ratio indicators - Dynamic burn rate complete ✅ Latency indicators - Dynamic burn rate complete ✅ LatencyNative indicators - Dynamic burn rate complete ✅ BoolGauge indicators - Dynamic burn rate complete Next: UI integration and Grafana dashboard updates (future sessions)
- Create new prompts/ folder for session organization - Move existing session prompts from .dev-docs/ to prompts/ - Add NEXT_SESSION_PROMPT.md focused on React UI integration - Add prompts/README.md documenting session strategy Next session focus: - React UI integration for BurnRateType selection - Update SLO forms to support dynamic burn rate configuration - Backend implementation complete and ready for frontend work
- NEXT_SESSION_PROMPT.md → UI_INTEGRATION_SESSION_PROMPT.md - SESSION_CONTINUATION_PROMPT.md → BACKEND_COMPLETION_SESSION_PROMPT.md - Update README.md with new prompt names and usage guide Better organization: - Clear indication of session purpose and focus area - Active vs completed session status - Easy identification of which prompt to use next
- Add burn rate type display system with color-coded badges - Implement burn rate column in SLO list with sorting and visibility controls - Add burn rate information section to SLO detail pages - Create TypeScript infrastructure with BurnRateType enum and utilities - Add dynamic/static icons for visual distinction - Implement responsive design with tooltips and accessibility - Create demo SLO configurations for testing - Add comprehensive UI documentation - Update feature implementation status - Prepare next session prompt for API integration The UI foundation is now complete with mock detection logic. Next phase: API integration to eliminate mock data and connect to actual backend burn rate type field.
✅ All 5 core tasks completed: 1. Added Alerting message with burnRateType field to protobuf schema 2. Updated Go conversion functions (ToInternal/FromInternal) in objectives.go 3. Regenerated TypeScript protobuf definitions and implementations 4. Replaced mock detection logic with real API field access in burnrate.tsx 5. Validated end-to-end API integration with comprehensive testing � Technical Implementation: - Protobuf: Added Alerting message with string burn_rate_type field - Go: Complete bidirectional conversion between internal structs and protobuf - TypeScript: Manual updates for Windows compatibility with proper interfaces - Frontend: Real API field access (objective.alerting?.burnRateType) - Testing: Round-trip validation for both 'dynamic' and 'static' types � Status: API Integration Complete - Production Ready � Next: Priority 2 Alert Display Updates
…mic burn rates - Verified all 5 generic rules work identically for static and dynamic SLOs - Confirmed Grafana dashboards display both SLO types correctly without modifications - Validated error budget calculations use same formula for both types - Tested list and detail dashboards with mixed static/dynamic SLOs - Documented pre-existing Rate graph query bug (unrelated to feature) - Created comprehensive validation session document - Result: NO CHANGES NEEDED - dashboards work perfectly with dynamic SLOs
- Analyzed BurnRateThresholdDisplay implementation (uses raw metrics) - Validated recording rules provide 40x speedup for ratio indicators - Created validation tools for performance testing - Documented optimization strategy and performance benchmarks - Created sub-tasks 7.10.1-7.10.4 for implementation phase Analysis documents: - TASK_7.10_UI_QUERY_OPTIMIZATION_ANALYSIS.md - Full analysis - TASK_7.10_VALIDATION_RESULTS.md - Performance benchmarks - TASK_7.10_COMPLETION_SUMMARY.md - Phase 1 summary Validation tools: - cmd/validate-ui-query-optimization - Performance comparison - cmd/test-burnrate-threshold-queries - Query validation Key findings: - Ratio indicators: 694ms -> 17ms (40x speedup potential) - Latency indicators: 43ms -> 26ms (1.7x speedup potential) - Recording rules exist but UI doesn't use them yet - Implementation will happen in sub-tasks 7.10.1-7.10.4
- Fixed test queries to use only SLO window recording rules (not alert windows) - Added statistical rigor: 10 iterations per query with min/max/avg analysis - Added BoolGauge indicator testing (all three types now covered) - Executed tests and documented real performance measurements: * Ratio: 7.17x speedup (48.75ms -> 6.80ms) * Latency: 2.20x speedup (6.34ms -> 2.89ms) * BoolGauge: No benefit (already fast at 3ms) - Clarified terminology: SLO window vs alert windows - Key finding: Only SLO window has increase/count recording rules - Updated task 7.10.2 with validation findings and implementation guide - Created comprehensive implementation guide for next task
- Add hybrid query approach: recording rules for SLO window + inline for alert windows - Implement getBaseMetricName() to strip metric suffixes for recording rule naming - Implement getTrafficRatioQueryOptimized() for optimized query generation - Optimize ratio indicators (7.17x query speedup) and latency indicators (2.20x speedup) - Skip boolGauge optimization (already fast at 3ms) - Fix performance monitoring bug (was showing 677s due to never-reset timer) - Primary benefit: Prometheus load reduction, not UI speed (network overhead dominates) - Maintains backward compatibility with fallback to raw metrics Task 7.10.2 complete
- Add references to TASK_7.10_VALIDATION_RESULTS.md, TASK_7.10_IMPLEMENTATION.md, and TASK_7.10.1_TEST_IMPROVEMENTS.md - Include key findings from 7.10.2: network overhead dominates, main benefit is Prometheus load reduction - Update 7.10.4 to reflect validation already completed in 7.10.2 - Clarify that optimization provides minimal UI benefit but significant infrastructure benefit
- Created comprehensive decision document analyzing backend optimization - Documented current implementation vs optimized pattern - Calculated performance benefits: 7x for ratio, 2x for latency indicators - Production impact: ~1.77M seconds/year saved for ratio indicators at scale - Decision: IMPLEMENT optimization (primary benefit: Prometheus load reduction) - Added Task 7.10.5 to implementation plan for backend optimization - Updated feature implementation summary with Task 7.10.3 completion Key findings: - Alert rules evaluated every 30s (different profile than UI on-demand queries) - Main benefit is infrastructure load reduction, not alert evaluation speed - Hybrid approach: recording rule for SLO window + inline for alert windows - Consistent with UI implementation (Task 7.10.2) - Priority: MEDIUM-HIGH, implement after Task 7.10.4 References: - .dev-docs/TASK_7.10.3_BACKEND_OPTIMIZATION_DECISION.md - .dev-docs/FEATURE_IMPLEMENTATION_SUMMARY.md - .kiro/specs/dynamic-burn-rate-completion/tasks.md
- Added getBaseMetricName() helper function to strip metric suffixes - Updated buildDynamicAlertExpr() for ratio indicators to use hybrid approach (recording rules for SLO window) - Updated buildDynamicAlertExpr() for latency indicators to use hybrid approach - Skipped boolGauge optimization (already fast, no benefit) - Fixed UI regression: BurnRateThresholdDisplay now uses actual SLO window instead of hardcoded 30d - This fixes 'no data available' issue for synthetic SLOs with 1d window - Backend optimization provides 7x speedup for ratio, 2x for latency indicators - Primary benefit: Prometheus CPU/memory load reduction at scale
…ll 7.10 sub-tasks
- Fixed critical latency threshold bug (2x traffic counting) - Added le="" label selector in UI and backend for latency indicators - Prevents summing both total and success recording rules - Fixed BurnrateGraph to show dynamic thresholds over time - Changed from instant to range queries for traffic calculation - Fixed React console warnings - Toggle: Added readOnly attribute - Detail.tsx: Fixed duplicate keys - AlertsTable: Added Fragment keys - DurationGraph: Added null checks - Removed debug logging from production code - Validated all indicator types (Ratio, Latency, BoolGauge, LatencyNative) - Updated documentation and steering standards
- Created SLO generator tool with window variation (7d, 28d, 30d) - Created performance monitoring tool for metrics collection - Created automated test script for health checks and validation - Generated 50 test SLOs ready for scale testing - Consolidated documentation into TASK_7.11_TESTING_INFRASTRUCTURE.md - Created TASK_7.12_MANUAL_TESTING_GUIDE.md for interactive testing - Updated tasks.md with proper task structure and references - Cleaned up redundant documentation files
- Executed baseline performance test with 16 current SLOs - Applied and tested 50 additional SLOs (medium scale: 66 total) - Applied and tested 100 additional SLOs (large scale: 116 total) - Collected comprehensive performance metrics (API response time, memory usage, Prometheus query performance) - Created PRODUCTION_PERFORMANCE_BENCHMARKS.md with detailed analysis - Key findings: Sub-linear API scaling, near-constant memory usage, stable Prometheus performance - Production readiness assessment: READY - Updated gitignore to exclude temporary test binaries and JSON metrics files
…ul degradation - Tested Chrome and Firefox (both PASS - identical behavior) - Tested graceful degradation: network throttling, API failures, Prometheus unavailability (all PASS) - Tested migration: static to dynamic, rollback, backward compatibility (all PASS) - Created browser compatibility matrix with test results and recommendations - Created comprehensive migration guide (validated during testing) - Discovered and documented 3 issues (1 HIGH severity, 2 LOW severity) - Created Task 7.12.1 for critical bug fix (white page crash for missing metrics) Deliverables: - .dev-docs/BROWSER_COMPATIBILITY_MATRIX.md - Complete test results - .dev-docs/MIGRATION_GUIDE.md - Migration instructions and best practices - .dev-docs/TASK_7.12_TESTING_COMPLETION_SUMMARY.md - Testing summary Production readiness: Ready for environments with reliable metrics. Fix Task 7.12.1 before deploying to environments with potentially missing metrics.
- Fix BurnrateGraph white page crash for dynamic SLOs with missing metrics - Add comprehensive null/undefined checks before Array.from() calls - Wrap dynamic threshold calculation in try-catch for graceful error handling - Fallback to static threshold when traffic data is missing/broken - Add console warnings for debugging - Fix Detail page showing 100% instead of 'No data' for missing metrics - Change default values from errors=0, total=1 to undefined - Tiles now correctly display 'No data' (consistent with main page) - Update documentation with fix details and testing coverage
- Performed systematic regression testing against upstream-comparison branch - Validated production build with all recent fixes (Task 7.12.1) - Zero regressions found - all original Pyrra functionality preserved - All 4 production build tests passed successfully - Tested 16 SLOs (4 static, 12 dynamic) in mixed environment Regression Testing Results: - Static SLO behavior identical to baseline (except intentional enhancements) - 6 intentional new features successfully integrated - No visual glitches, layout issues, or console errors - Auto-reload confirmed as original Pyrra behavior (not a regression) Production Build Validation: - Critical Task 7.12.1 fixes working perfectly (no white page crash) - All indicator types working correctly (ratio, latency, latencyNative, boolGauge) - Graceful error handling for missing/broken metrics - Performance acceptable (< 3 seconds page load) - 1 minor cosmetic issue found (false console warning - not blocking) Key Findings: - Backend service required for proper burn rate type detection - Mixed static/dynamic environment stable and working correctly - Feature is production ready for upstream contribution Documentation: - Created comprehensive test results document - Created step-by-step testing procedure guide - Created quick reference checklist - Updated feature implementation summary Status: PRODUCTION READY - Zero blockers, ready for upstream contribution
- Restructured Task 8 to focus on upstream integration (fetch/merge, file organization, production docs, PR description) - Streamlined Task 9 to reference existing validation work (Tasks 1-7 already complete) - Added UPSTREAM_CONTRIBUTION_PLAN.md with file organization strategy and PR preparation guide - Emphasized keeping production documentation updates concise and proportional - Removed duplicate testing/documentation tasks already completed in Tasks 1-7
- Added Task 8.0 as mandatory pre-merge cleanup step (must do before Task 8.1) - Created comprehensive cleanup checklist in TASK_8.0_PRE_MERGE_CLEANUP_CHECKLIST.md - Addresses manual code review findings: - Revert unintended changes (CONTRIBUTING.md, deployment manifests, index.html, etc.) - Move examples from .dev/ to examples/ - Backend code cleanup (remove duplicates in slo/rules.go, unused code in slo/slo.go) - CRD cleanup (remove redundant variables) - Test file review and decisions - UI code review (Toggle.tsx, old docs) - Investigate filesystem.go changes and determine testing needs - Investigate proto changes - Updated Task 9.3 to reference filesystem mode testing decision from Task 8.0 - Updated UPSTREAM_CONTRIBUTION_PLAN.md timeline to include Task 8.0
- Reverted unintended changes (pyrra-kubernetesDeployment.yaml, ui/public/index.html, filesystem.go) - Removed unused code (~47 lines from slo/slo.go and CRD types) - Updated comment format in slo/rules.go - Moved ui/DYNAMIC_BURN_RATE_UI.md to .dev-docs/HISTORICAL_UI_DESIGN.md - Updated CONTRIBUTING.md with ui/README.md reference - Clarified architecture: test metric only needed in API server (main.go) - Created comprehensive cleanup documentation All tests passing, code compiles successfully.
- Add 4 dynamic burn rate examples (ratio, latency, latencyNative, boolGauge) - Create concise examples/README.md (~70 lines, comparable to upstream) - Use real metrics from actual services (apiserver, prometheus, pyrra) - Minimal comments consistent with existing examples - All examples verified showing actual data in Pyrra UI - Delete redundant latency-dynamic-burnrate.yaml and simple-demo.yaml - Add Task 8.5 for regex label selector investigation - Update documentation to reflect 4 examples Task 8.2 complete
- Task 8.4.1: Comprehensive upstream comparison testing
- Tested regex selectors on upstream-comparison branch
- Created test SLO configurations for validation
- Confirmed no regressions from feature branch
- Task 8.4.2: Root cause analysis
- Identified grouping creates multiple SLOs (upstream behavior)
- Identified NaN display issue (upstream cosmetic bug)
- Documented technical architecture and design
- Task 8.4.3: Solution implementation
- Chose documentation approach (no code changes needed)
- Updated KNOWN_LIMITATIONS.md with user guidance
- Provided best practices and workarounds
Key findings:
- Regex selectors work correctly in both upstream and feature branch
- Multiple SLO behavior with grouping is existing upstream design
- NaN issue affects all SLOs universally (not regex-specific)
- No regressions introduced by dynamic burn rate feature
- Feature ready for upstream contribution
Documentation created:
- .dev-docs/UPSTREAM_COMPARISON_REGEX_SELECTORS.md (complete test results)
- .dev-docs/KNOWN_LIMITATIONS.md (user-facing guidance)
- .dev-docs/TASK_8.4.{1,2,3}_*.md (sub-task documentation)
- .dev-docs/TASK_8.4_COMPLETE_SUMMARY.md (overall summary)
- Fixed duplicate task 8.3 (renamed second one to 8.5) - Marked task 8.4 as complete [x] - Marked task 8.4.3 as complete [x] (was [-]) - Renumbered 'Create pull request' task from 8.5 to 8.6 Task order now: - 8.3: Organize files for PR vs fork separation - 8.4: Investigate regex label selector (COMPLETE) - 8.5: Update production documentation - 8.6: Create pull request description
- Add concise dynamic burn rate section to README.md with Dev.to article reference - Enhance examples/README.md with usage guidance and migration notes - Add inline comments to all 4 dynamic burn rate example files - Correct mathematical explanations (high traffic = lower threshold) - Document task completion and mathematical correction - Follow 'concise and proportional' principle - dynamic burn rate is ONE feature Files updated: - README.md: Added Dynamic Burn Rate Alerting section - examples/README.md: Enhanced dynamic burn rate examples section - examples/dynamic-burn-rate-*.yaml: Added header and inline comments (4 files) - .dev-docs/: Added task documentation and math correction notes - .dev-docs/FEATURE_IMPLEMENTATION_SUMMARY.md: Updated with task 8.5 completion
- Created comprehensive file categorization document - Documented 10 categories: PR files vs fork files - Defined preservation strategy for dev artifacts - Provided 7-step action plan for file organization - Added verification checklist with 14 items - Updated feature implementation summary
- Remove all development-only files (.dev-docs, .kiro, cmd/, scripts/, prompts/, testing/) - Remove custom Docker files (Dockerfile.custom, Dockerfile.dev) - Update test expectations for le='' label on latency recording rules - Update test expectations for errors recording rules on ratio indicators - Add default Alerting field values to test objectives - Changes reflect query optimization work (task 7.10) and file organization (task 8.3) All tests passing, builds successful (backend + UI)
- Remove debug console.log statements from List.tsx and AlertsTable.tsx - Improve error logging (console.log -> console.error with context) - Part of code quality and standards review for upstream contribution Cherry-picked UI changes from dev-tools-and-docs branch (d13b8fc)
|
Wow! First thing, I'll have to read your blog post and fully understand how things are working. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR implements dynamic burn rate alerting that adapts alert thresholds based on actual traffic patterns, preventing false positives during low traffic and false negatives during high traffic periods.
Motivation
Traditional static burn rate multipliers (14x, 7x, 2x, 1x) don't account for traffic variations, leading to:
Dynamic burn rates solve this by calculating thresholds that maintain consistent absolute error budget consumption regardless of traffic volume:
Key Insight: This formula ensures alerts fire at the same absolute number of errors regardless of traffic. The threshold percentage adapts to traffic: lower during high traffic, higher during low traffic, but always requiring the same absolute error count.
This methodology is based on my blog post, "Error Budget Is All You Need - Part 2".
Implementation Summary
Backend Changes
Core Implementation (
slo/rules.go):buildDynamicAlertExpr()method implementing traffic-aware threshold calculationBurnrates()method to route between static and dynamic expressionsCRD Changes (
kubernetes/api/v1alpha1/servicelevelobjective_types.go):BurnRateTypefield to SLO spec (values: "static", "dynamic")Indicator Type Support:
increase()for traffic calculation_countmetrics withle=""label selectorhistogram_count(sum(increase(...)))for native histogramscount_over_time()for boolean gauge observationsAPI Changes
Protobuf (
proto/objectives/v1alpha1/objectives.proto):burn_rate_typefield to Objective messageUI Changes
Core Components:
ui/src/List.tsx): Added "Burn Rate" column with sortable badgesui/src/Detail.tsx): Added burn rate type badge with traffic contextui/src/AlertsTable.tsx): Added "Error Budget Consumption" columnui/src/components/BurnRateThresholdDisplay.tsx): Real-time dynamic threshold calculationui/src/components/BurnrateGraph.tsx): Dynamic threshold visualizationUser Experience Enhancements:
Backend Alert Rules:
Testing Evidence
Mathematical Validation
Core Concept: Dynamic burn rates maintain consistent absolute error budget consumption regardless of traffic volume.
Mathematical Proof:
Alert fires when:
error_rate > (N_SLO / N_alert) × E_budget_percent × (1 - SLO_target)Since
error_rate = errors / N_alert, we can substitute:Multiply both sides by N_alert:
Since
N_SLO × (1 - SLO_target) = E_budget(absolute error budget for SLO period):Result: The N_alert terms cancel out! Alerts fire at the same absolute error count regardless of traffic.
Example Validation:
Given:
High Traffic Scenario:
Low Traffic Scenario:
Same absolute threshold (200 errors), vastly different error rate thresholds (2% vs 20%)!
Benefits:
Validation Results:
UI Regression Testing
Test Environment:
Regression Testing Results:
Production Build Validation:
Alert Firing Validation
Validated Results:
Browser Compatibility Testing
Browsers Tested:
Graceful Degradation Testing:
Breaking Changes
None. This feature is completely opt-in and backward compatible:
burnRateType: static(existing behavior)burnRateType: dynamicto enable new behaviorMigration Guide
Enabling Dynamic Burn Rates
For new SLOs, add
burnRateType: dynamicto the alerting section:For existing SLOs, edit the SLO YAML and add
burnRateType: dynamic:Validation
After enabling dynamic burn rates:
Check Prometheus Rules: Verify dynamic expressions generated
Check UI: Verify green "Dynamic" badge appears on SLO list page
Check Thresholds: Verify calculated threshold values display in alerts table
Monitor Alerts: Observe alert behavior with traffic variations
Rollback
To revert to static burn rates:
Or remove the field entirely (defaults to static).
Examples
Four comprehensive examples are provided in the
examples/directory:examples/dynamic-burn-rate-ratio.yaml- Ratio indicator (API success rate)examples/dynamic-burn-rate-latency.yaml- Latency indicator (histogram-based)examples/dynamic-burn-rate-latency-native.yaml- Native histogram latencyexamples/dynamic-burn-rate-bool-gauge.yaml- Boolean gauge (availability)Each example includes:
Design Decisions
1. Opt-In Feature (Not Default)
Decision: Dynamic burn rates require explicit
burnRateType: dynamicconfigurationRationale:
2. Latency Indicator Label Selector
Decision: Always add
le=""label selector when querying latency recording rulesRationale:
le="",sum()aggregation includes BOTH rules (2x traffic)le=""selector ensures only total traffic is counted3. Error Handling Strategy
Decision: Graceful degradation with fallback displays instead of crashes
Rationale:
Documentation
User-Facing Documentation
Updated Files:
README.md- Added dynamic burn rate feature sectionexamples/README.md- Added dynamic SLO examples with explanationsexamples/*.yaml- Four comprehensive example configurationsDevelopment Documentation (Fork Only)
Comprehensive Internal Documentation (40+ documents in
.dev-docs/):Development Tools (Fork Only -
cmd/directory):References
Methodology:
Before/After Examples
Example 1: Static vs Dynamic Threshold Comparison
Scenario: API service with 99% SLO target, 30d window
Static Burn Rate (Factor 14):
Dynamic Burn Rate (High Traffic):
Dynamic Burn Rate (Low Traffic):
Key Insight: Both scenarios require the same absolute number of errors (200), but the error rate thresholds differ dramatically (2% vs 20%).
Example 2: UI Display Comparison
Static SLO - List Page:
Dynamic SLO - List Page:
Static SLO - Alerts Table:
Dynamic SLO - Alerts Table:
Tooltip Comparison:
Static SLO Tooltip:
Dynamic SLO Tooltip:
Example 3: Alert Rule Comparison
Static Alert Rule (PrometheusRule):
Dynamic Alert Rule (PrometheusRule):
Key Differences: