-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Feature Description
Before enabling an alert rule, show users how many times it would have triggered in the past 7 days (or custom time window). This "alert preview" feature helps users tune thresholds, avoid alert fatigue, and build confidence that the alert will actually be useful.
Problem/Use Case
Current problem:
- Users create alert rules blindly, hoping the threshold is "about right"
- Alert goes live and either:
- Fires constantly (alert fatigue) → gets disabled
- Never fires (threshold too high) → misses real issues
- No way to know if an alert is tuned correctly without trial-and-error
- Takes weeks to realize an alert is poorly configured
- Teams lose trust in alerting systems due to false positives
Real-world scenario:
DevOps creates alert: "Trigger when error rate > 100/min"
Possibilities:
❌ Too sensitive: Fires 50 times/day → ignored → real issue missed
❌ Too loose: Never fires → critical outage goes unnoticed for hours
✓ Just right: Fires 2-3 times/week for real issues
Problem: No way to know which scenario you're in until it's live!
User frustration:
"I set up an alert for high error rates, but I have no idea if 100 errors/min is a good threshold. Should it be 50? 500? I'm just guessing."
Proposed Solution
Add "Alert Preview" feature that analyzes historical data:
UI/UX Flow:
Step 1: Create alert rule (as usual)
Alert Name: High Error Rate
Condition: level:error
Threshold: rate > 100/min for 5 minutes
Step 2: Click "Preview Alert" button
Step 3: See analysis
┌─────────────────────────────────────────────────────┐
│ 📊 Alert Preview (Last 7 Days) │
├─────────────────────────────────────────────────────┤
│ │
│ This alert would have fired 23 times │
│ │
│ Breakdown: │
│ • 15 times on weekdays (during business hours) │
│ • 8 times on weekends │
│ │
│ Average duration: 3.2 minutes │
│ Longest incident: 47 minutes (Jan 12, 14:32) │
│ │
│ Most recent trigger: │
│ • Yesterday at 14:32 (347 errors/min, 12min) │
│ • Jan 13 at 09:15 (156 errors/min, 4min) │
│ • Jan 12 at 14:32 (523 errors/min, 47min) ← worst │
│ │
│ ⚠️ Suggestion: │
│ This alert may be too sensitive. Consider: │
│ • Increasing threshold to 150/min │
│ • Adding time-of-day filters (weekdays only) │
│ • Requiring 10min duration instead of 5min │
│ │
│ [Adjust Threshold] [Enable Alert] [Cancel] │
└─────────────────────────────────────────────────────┘
Step 4: Adjust and re-preview
User changes threshold: 100/min → 150/min
Clicks "Preview" again
New result: "Would have fired 7 times" ← much better!
Step 5: Enable with confidence
User clicks "Enable Alert"
→ Alert goes live with tuned threshold
→ Minimal false positives
→ Team trusts the alert system
Alternatives Considered
-
Manual backtesting
- User must manually query logs and count matches
- ✗ Time-consuming, error-prone
- ✗ Doesn't show timeline or suggestions
-
"Dry run" mode for alerts
- Alert runs but doesn't notify, logs what would have fired
- ✗ Must wait days/weeks to gather data
- ✗ Still trial-and-error
- ✓ Could complement preview feature
-
AI-suggested thresholds
- ML analyzes patterns and suggests optimal values
- ✗ Black box, users don't understand why
- ✗ Requires ML infrastructure
- ✗ Overkill for most cases
-
Show only aggregated stats (no preview)
- Display "avg errors/min: 45" in UI
- ✗ User still has to mentally calculate
- ✗ Doesn't show actual trigger events
Chosen approach: Historical simulation with visual timeline + actionable suggestions
Implementation Details (Optional)
Technical approach:
1. Backend: Alert simulation engine
interface AlertPreview {
totalTriggers: number;
incidents: AlertIncident[];
suggestions: AlertSuggestion[];
statistics: {
avgDuration: number;
maxDuration: number;
byDayOfWeek: Record<string, number>;
byHourOfDay: Record<number, number>;
};
}
interface AlertIncident {
startTime: Date;
endTime: Date;
duration: number; // minutes
peakValue: number;
sampleLogs: LogEntry[];
}
async function previewAlert(
rule: AlertRule,
timeWindow: { start: Date; end: Date }
): Promise<AlertPreview> {
// 1. Execute alert query against historical logs
const results = await queryHistoricalLogs(rule.query, timeWindow);
// 2. Apply threshold logic with sliding window
const incidents: AlertIncident[] = [];
let currentIncident: AlertIncident | null = null;
for (const window of slidingWindows(results, rule.duration)) {
const value = aggregateWindow(window, rule.aggregation); // count, rate, etc.
if (evaluateThreshold(value, rule.threshold)) {
if (!currentIncident) {
currentIncident = {
startTime: window.start,
endTime: window.end,
peakValue: value,
sampleLogs: window.logs.slice(0, 5),
};
} else {
// Extend current incident
currentIncident.endTime = window.end;
currentIncident.peakValue = Math.max(currentIncident.peakValue, value);
}
} else if (currentIncident) {
// Incident ended
currentIncident.duration =
(currentIncident.endTime - currentIncident.startTime) / 60000;
incidents.push(currentIncident);
currentIncident = null;
}
}
// 3. Generate statistics
const statistics = calculateStatistics(incidents);
// 4. Generate suggestions
const suggestions = generateSuggestions(rule, incidents, statistics);
return {
totalTriggers: incidents.length,
incidents,
suggestions,
statistics,
};
}2. Suggestion engine
function generateSuggestions(
rule: AlertRule,
incidents: AlertIncident[],
stats: AlertStatistics
): AlertSuggestion[] {
const suggestions: AlertSuggestion[] = [];
// Too many triggers?
if (incidents.length > 20) {
suggestions.push({
type: 'threshold_too_low',
message: 'Alert may be too sensitive (23 triggers in 7 days)',
action: {
type: 'adjust_threshold',
currentValue: rule.threshold,
suggestedValue: calculateOptimalThreshold(incidents, 0.3), // 30th percentile
reason: 'Would reduce triggers to ~7/week',
},
});
}
// Too few triggers?
if (incidents.length === 0) {
suggestions.push({
type: 'threshold_too_high',
message: 'Alert would never have fired',
action: {
type: 'adjust_threshold',
currentValue: rule.threshold,
suggestedValue: calculateOptimalThreshold(incidents, 0.95),
reason: 'Would catch 95th percentile spikes',
},
});
}
// Noisy during specific times?
if (stats.byHourOfDay[2] > incidents.length * 0.3) {
suggestions.push({
type: 'time_filter',
message: '30% of triggers happen at 2am (likely batch jobs)',
action: {
type: 'add_time_filter',
suggestedFilter: 'hour >= 6 AND hour <= 22', // Only 6am-10pm
reason: 'Exclude scheduled maintenance windows',
},
});
}
return suggestions;
}3. Frontend UI
// Alert preview component
function AlertPreviewModal({ rule, onClose, onApply }) {
const [preview, setPreview] = useState<AlertPreview | null>(null);
const [loading, setLoading] = useState(true);
const [timeWindow, setTimeWindow] = useState('7d');
useEffect(() => {
loadPreview();
}, [rule, timeWindow]);
async function loadPreview() {
setLoading(true);
const result = await api.previewAlert(rule, timeWindow);
setPreview(result);
setLoading(false);
}
function applySuggestion(suggestion: AlertSuggestion) {
// Update rule with suggested changes
// Re-run preview with new values
}
return (
<Modal>
<h2>Alert Preview: {rule.name}</h2>
{loading ? <Spinner /> : (
<>
<StatsSummary preview={preview} />
<IncidentTimeline incidents={preview.incidents} />
<Suggestions
suggestions={preview.suggestions}
onApply={applySuggestion}
/>
<Button onClick={() => onApply(rule)}>
Enable Alert
</Button>
</>
)}
</Modal>
);
}4. Database optimization
-- Preview queries need to be fast
-- Ensure indexes support common alert patterns
CREATE INDEX idx_logs_preview
ON logs (source_id, timestamp, level)
WHERE timestamp > NOW() - INTERVAL '30 days';
-- For rate-based alerts
CREATE INDEX idx_logs_time_bucket
ON logs (source_id, time_bucket('1 minute', timestamp));Performance considerations:
- Cache preview results (invalidate on new logs)
- Limit preview window to max 30 days
- Sample data for very high-volume sources
- Run preview queries asynchronously (show progress bar)
- Pre-aggregate common metrics (errors/min, etc.)
Priority
- Critical - Blocking my usage of LogTide
- High - Would significantly improve my workflow
- Medium - Nice to have
- Low - Minor enhancement
Rationale: This feature dramatically reduces alert fatigue and makes Logtide's alerting system actually usable for production teams. It's the difference between "alerts I trust" and "alerts I ignore."
Target Users
- DevOps Engineers (primary: responsible for alerting)
- Developers (configure alerts for their services)
- Security/SIEM Users (tune security alerts)
- System Administrators
- All Users
Primary benefit: Anyone who creates alerts and wants them to be useful, not noisy.
Additional Context
Why this is important:
1. Alert fatigue is a real problem:
Gartner study: "50% of alerts are ignored due to false positives"
PagerDuty: "Average team receives 200+ alerts/week, only 30 are actionable"
With preview:
→ User sees "would fire 200 times/week"
→ Adjusts threshold
→ New preview: "would fire 8 times/week"
→ Enables alert with confidence
2. Competitive differentiation:
- Datadog: No preview feature (just trial-and-error)
- PagerDuty: Has "alert testing" but requires live traffic
- Grafana: No built-in preview
- Splunk: Has backtesting but it's complex
- Logtide advantage: Built-in, visual, actionable
3. Trust-building:
Users trust Logtide more when it helps them avoid mistakes before making them.
Real user scenario:
Without preview:
Day 1: Create alert "error rate > 50/min"
Day 2: Alert fires 30 times
Day 3: Increase to 100/min, still fires 20 times
Day 4: Increase to 200/min, never fires
Day 5: Miss critical outage because threshold too high
Week 2: Disable alert entirely, go back to manual monitoring
With preview:
Day 1: Create alert "error rate > 50/min"
Day 1: Preview shows "would fire 89 times in last week"
Day 1: Adjust to 120/min, preview shows "would fire 6 times"
Day 1: Enable alert with confidence
Week 2: Alert fires twice for real issues, team responds
Marketing angles:
"Stop guessing. Start knowing. Preview exactly how your alerts will behave before enabling them."
"Logtide's Alert Preview helps you tune thresholds in seconds, not weeks."
Future enhancements:
- Compare multiple threshold values side-by-side
- Export preview report for team review
- "Seasonal" preview (compare same day last month)
- Integration with detection packs (preview entire pack)
- A/B testing for alert rules
Educational content opportunity:
Blog post: "The Alert Tuning Problem (And How We Solved It)"
- Explain alert fatigue
- Show preview feature
- Include best practices
- Position Logtide as thoughtfully designed
Implementation phases:
MVP (v1):
- Basic preview: trigger count, last 7 days
- Simple timeline of incidents
- No suggestions (yet)
v2:
- Add suggestions engine
- Time-of-day analysis
- Duration statistics
v3:
- Multiple threshold comparison
- Seasonal analysis
- Team sharing
Contribution
- I would like to work on implementing this feature