Skip to content

[Feature] Alert Preview & "Would Have Fired" Analysis #91

@Polliog

Description

@Polliog

Feature Description

Before enabling an alert rule, show users how many times it would have triggered in the past 7 days (or custom time window). This "alert preview" feature helps users tune thresholds, avoid alert fatigue, and build confidence that the alert will actually be useful.

Problem/Use Case

Current problem:

  • Users create alert rules blindly, hoping the threshold is "about right"
  • Alert goes live and either:
    • Fires constantly (alert fatigue) → gets disabled
    • Never fires (threshold too high) → misses real issues
  • No way to know if an alert is tuned correctly without trial-and-error
  • Takes weeks to realize an alert is poorly configured
  • Teams lose trust in alerting systems due to false positives

Real-world scenario:

DevOps creates alert: "Trigger when error rate > 100/min"

Possibilities:
❌ Too sensitive: Fires 50 times/day → ignored → real issue missed
❌ Too loose: Never fires → critical outage goes unnoticed for hours
✓ Just right: Fires 2-3 times/week for real issues

Problem: No way to know which scenario you're in until it's live!

User frustration:

"I set up an alert for high error rates, but I have no idea if 100 errors/min is a good threshold. Should it be 50? 500? I'm just guessing."

Proposed Solution

Add "Alert Preview" feature that analyzes historical data:

UI/UX Flow:

Step 1: Create alert rule (as usual)

Alert Name: High Error Rate
Condition: level:error
Threshold: rate > 100/min for 5 minutes

Step 2: Click "Preview Alert" button

Step 3: See analysis

┌─────────────────────────────────────────────────────┐
│ 📊 Alert Preview (Last 7 Days)                     │
├─────────────────────────────────────────────────────┤
│                                                     │
│ This alert would have fired 23 times                │
│                                                     │
│ Breakdown:                                          │
│ • 15 times on weekdays (during business hours)     │
│ • 8 times on weekends                              │
│                                                     │
│ Average duration: 3.2 minutes                       │
│ Longest incident: 47 minutes (Jan 12, 14:32)       │
│                                                     │
│ Most recent trigger:                                │
│ • Yesterday at 14:32 (347 errors/min, 12min)       │
│ • Jan 13 at 09:15 (156 errors/min, 4min)           │
│ • Jan 12 at 14:32 (523 errors/min, 47min) ← worst  │
│                                                     │
│ ⚠️ Suggestion:                                      │
│ This alert may be too sensitive. Consider:         │
│ • Increasing threshold to 150/min                  │
│ • Adding time-of-day filters (weekdays only)       │
│ • Requiring 10min duration instead of 5min         │
│                                                     │
│ [Adjust Threshold] [Enable Alert] [Cancel]         │
└─────────────────────────────────────────────────────┘

Step 4: Adjust and re-preview

User changes threshold: 100/min → 150/min
Clicks "Preview" again
New result: "Would have fired 7 times" ← much better!

Step 5: Enable with confidence

User clicks "Enable Alert"
→ Alert goes live with tuned threshold
→ Minimal false positives
→ Team trusts the alert system

Alternatives Considered

  1. Manual backtesting

    • User must manually query logs and count matches
    • ✗ Time-consuming, error-prone
    • ✗ Doesn't show timeline or suggestions
  2. "Dry run" mode for alerts

    • Alert runs but doesn't notify, logs what would have fired
    • ✗ Must wait days/weeks to gather data
    • ✗ Still trial-and-error
    • ✓ Could complement preview feature
  3. AI-suggested thresholds

    • ML analyzes patterns and suggests optimal values
    • ✗ Black box, users don't understand why
    • ✗ Requires ML infrastructure
    • ✗ Overkill for most cases
  4. Show only aggregated stats (no preview)

    • Display "avg errors/min: 45" in UI
    • ✗ User still has to mentally calculate
    • ✗ Doesn't show actual trigger events

Chosen approach: Historical simulation with visual timeline + actionable suggestions

Implementation Details (Optional)

Technical approach:

1. Backend: Alert simulation engine

interface AlertPreview {
  totalTriggers: number;
  incidents: AlertIncident[];
  suggestions: AlertSuggestion[];
  statistics: {
    avgDuration: number;
    maxDuration: number;
    byDayOfWeek: Record<string, number>;
    byHourOfDay: Record<number, number>;
  };
}

interface AlertIncident {
  startTime: Date;
  endTime: Date;
  duration: number; // minutes
  peakValue: number;
  sampleLogs: LogEntry[];
}

async function previewAlert(
  rule: AlertRule,
  timeWindow: { start: Date; end: Date }
): Promise<AlertPreview> {
  // 1. Execute alert query against historical logs
  const results = await queryHistoricalLogs(rule.query, timeWindow);
  
  // 2. Apply threshold logic with sliding window
  const incidents: AlertIncident[] = [];
  let currentIncident: AlertIncident | null = null;
  
  for (const window of slidingWindows(results, rule.duration)) {
    const value = aggregateWindow(window, rule.aggregation); // count, rate, etc.
    
    if (evaluateThreshold(value, rule.threshold)) {
      if (!currentIncident) {
        currentIncident = {
          startTime: window.start,
          endTime: window.end,
          peakValue: value,
          sampleLogs: window.logs.slice(0, 5),
        };
      } else {
        // Extend current incident
        currentIncident.endTime = window.end;
        currentIncident.peakValue = Math.max(currentIncident.peakValue, value);
      }
    } else if (currentIncident) {
      // Incident ended
      currentIncident.duration = 
        (currentIncident.endTime - currentIncident.startTime) / 60000;
      incidents.push(currentIncident);
      currentIncident = null;
    }
  }
  
  // 3. Generate statistics
  const statistics = calculateStatistics(incidents);
  
  // 4. Generate suggestions
  const suggestions = generateSuggestions(rule, incidents, statistics);
  
  return {
    totalTriggers: incidents.length,
    incidents,
    suggestions,
    statistics,
  };
}

2. Suggestion engine

function generateSuggestions(
  rule: AlertRule,
  incidents: AlertIncident[],
  stats: AlertStatistics
): AlertSuggestion[] {
  const suggestions: AlertSuggestion[] = [];
  
  // Too many triggers?
  if (incidents.length > 20) {
    suggestions.push({
      type: 'threshold_too_low',
      message: 'Alert may be too sensitive (23 triggers in 7 days)',
      action: {
        type: 'adjust_threshold',
        currentValue: rule.threshold,
        suggestedValue: calculateOptimalThreshold(incidents, 0.3), // 30th percentile
        reason: 'Would reduce triggers to ~7/week',
      },
    });
  }
  
  // Too few triggers?
  if (incidents.length === 0) {
    suggestions.push({
      type: 'threshold_too_high',
      message: 'Alert would never have fired',
      action: {
        type: 'adjust_threshold',
        currentValue: rule.threshold,
        suggestedValue: calculateOptimalThreshold(incidents, 0.95),
        reason: 'Would catch 95th percentile spikes',
      },
    });
  }
  
  // Noisy during specific times?
  if (stats.byHourOfDay[2] > incidents.length * 0.3) {
    suggestions.push({
      type: 'time_filter',
      message: '30% of triggers happen at 2am (likely batch jobs)',
      action: {
        type: 'add_time_filter',
        suggestedFilter: 'hour >= 6 AND hour <= 22', // Only 6am-10pm
        reason: 'Exclude scheduled maintenance windows',
      },
    });
  }
  
  return suggestions;
}

3. Frontend UI

// Alert preview component
function AlertPreviewModal({ rule, onClose, onApply }) {
  const [preview, setPreview] = useState<AlertPreview | null>(null);
  const [loading, setLoading] = useState(true);
  const [timeWindow, setTimeWindow] = useState('7d');
  
  useEffect(() => {
    loadPreview();
  }, [rule, timeWindow]);
  
  async function loadPreview() {
    setLoading(true);
    const result = await api.previewAlert(rule, timeWindow);
    setPreview(result);
    setLoading(false);
  }
  
  function applySuggestion(suggestion: AlertSuggestion) {
    // Update rule with suggested changes
    // Re-run preview with new values
  }
  
  return (
    <Modal>
      <h2>Alert Preview: {rule.name}</h2>
      
      {loading ? <Spinner /> : (
        <>
          <StatsSummary preview={preview} />
          <IncidentTimeline incidents={preview.incidents} />
          <Suggestions 
            suggestions={preview.suggestions}
            onApply={applySuggestion}
          />
          
          <Button onClick={() => onApply(rule)}>
            Enable Alert
          </Button>
        </>
      )}
    </Modal>
  );
}

4. Database optimization

-- Preview queries need to be fast
-- Ensure indexes support common alert patterns

CREATE INDEX idx_logs_preview 
ON logs (source_id, timestamp, level)
WHERE timestamp > NOW() - INTERVAL '30 days';

-- For rate-based alerts
CREATE INDEX idx_logs_time_bucket
ON logs (source_id, time_bucket('1 minute', timestamp));

Performance considerations:

  • Cache preview results (invalidate on new logs)
  • Limit preview window to max 30 days
  • Sample data for very high-volume sources
  • Run preview queries asynchronously (show progress bar)
  • Pre-aggregate common metrics (errors/min, etc.)

Priority

  • Critical - Blocking my usage of LogTide
  • High - Would significantly improve my workflow
  • Medium - Nice to have
  • Low - Minor enhancement

Rationale: This feature dramatically reduces alert fatigue and makes Logtide's alerting system actually usable for production teams. It's the difference between "alerts I trust" and "alerts I ignore."

Target Users

  • DevOps Engineers (primary: responsible for alerting)
  • Developers (configure alerts for their services)
  • Security/SIEM Users (tune security alerts)
  • System Administrators
  • All Users

Primary benefit: Anyone who creates alerts and wants them to be useful, not noisy.

Additional Context

Why this is important:

1. Alert fatigue is a real problem:

Gartner study: "50% of alerts are ignored due to false positives"
PagerDuty: "Average team receives 200+ alerts/week, only 30 are actionable"

With preview:
→ User sees "would fire 200 times/week"
→ Adjusts threshold
→ New preview: "would fire 8 times/week"
→ Enables alert with confidence

2. Competitive differentiation:

  • Datadog: No preview feature (just trial-and-error)
  • PagerDuty: Has "alert testing" but requires live traffic
  • Grafana: No built-in preview
  • Splunk: Has backtesting but it's complex
  • Logtide advantage: Built-in, visual, actionable

3. Trust-building:
Users trust Logtide more when it helps them avoid mistakes before making them.

Real user scenario:

Without preview:

Day 1: Create alert "error rate > 50/min"
Day 2: Alert fires 30 times
Day 3: Increase to 100/min, still fires 20 times
Day 4: Increase to 200/min, never fires
Day 5: Miss critical outage because threshold too high
Week 2: Disable alert entirely, go back to manual monitoring

With preview:

Day 1: Create alert "error rate > 50/min"
Day 1: Preview shows "would fire 89 times in last week"
Day 1: Adjust to 120/min, preview shows "would fire 6 times"
Day 1: Enable alert with confidence
Week 2: Alert fires twice for real issues, team responds

Marketing angles:

"Stop guessing. Start knowing. Preview exactly how your alerts will behave before enabling them."

"Logtide's Alert Preview helps you tune thresholds in seconds, not weeks."

Future enhancements:

  • Compare multiple threshold values side-by-side
  • Export preview report for team review
  • "Seasonal" preview (compare same day last month)
  • Integration with detection packs (preview entire pack)
  • A/B testing for alert rules

Educational content opportunity:

Blog post: "The Alert Tuning Problem (And How We Solved It)"
- Explain alert fatigue
- Show preview feature
- Include best practices
- Position Logtide as thoughtfully designed

Implementation phases:

MVP (v1):

  • Basic preview: trigger count, last 7 days
  • Simple timeline of incidents
  • No suggestions (yet)

v2:

  • Add suggestions engine
  • Time-of-day analysis
  • Duration statistics

v3:

  • Multiple threshold comparison
  • Seasonal analysis
  • Team sharing

Contribution

  • I would like to work on implementing this feature

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions