Skip to content

[Feature] Event Correlation by ID (request_id, trace_id, user_id) #89

@Polliog

Description

@Polliog

Feature Description

Enable users to click any identifier (request_id, trace_id, user_id, order_id, etc.) in a log entry and instantly see all related logs across services, time ranges, and log sources. This creates a narrative timeline of events related to a specific transaction, user action, or request.

Problem/Use Case

Current problem:

  • Debugging distributed systems requires manually searching for request IDs across multiple log sources
  • Users must copy-paste IDs and run multiple searches to piece together what happened
  • No visual timeline showing the sequence of events across services
  • Difficult to understand causal relationships between log entries
  • Time-consuming to debug incidents involving multiple microservices

Real-world scenario:

1. User reports: "My payment failed at 14:32"
2. DevOps finds error log with request_id: req_abc123
3. Needs to search manually:
   - API gateway logs (initial request)
   - Auth service logs (token validation)
   - Payment service logs (charge attempt)
   - Database logs (transaction records)
   - Queue logs (webhook processing)
4. Must manually correlate timestamps and piece together the story
5. Takes 20-30 minutes to reconstruct what happened

With event correlation:

1. Click request_id: req_abc123
2. See complete timeline:
   14:32:01.234 [API Gateway] Request received
   14:32:01.456 [Auth] Token validated
   14:32:02.123 [Payment] Stripe API called
   14:32:03.789 [Payment] ERROR: Card declined ← root cause
   14:32:04.012 [Database] Transaction rolled back
   14:32:04.156 [Queue] Webhook retry scheduled
3. Problem identified in 30 seconds

Proposed Solution

Core feature: Auto-detect and link common identifiers

Phase 1: ID Detection & Linking

  • Auto-detect common ID patterns in logs:
    • UUID format (8-4-4-4-12)
    • Request IDs (req_, request-, etc.)
    • Trace IDs (trace_, span_, etc.)
    • User IDs (user_, uid_, etc.)
    • Transaction IDs (txn_, order_, etc.)
    • Correlation IDs (correlation_*, x-correlation-id)

Phase 2: UI/UX

Log entry view:
┌─────────────────────────────────────────────────────┐
│ 2025-01-15 14:32:03 ERROR Payment service          │
│ Card declined for user_789                         │
│                                                     │
│ request_id: req_abc123  ← clickable, highlighted   │
│ user_id: user_789      ← clickable, highlighted    │
│ transaction_id: txn_xyz ← clickable, highlighted   │
└─────────────────────────────────────────────────────┘

On click → Opens correlation view:
┌─────────────────────────────────────────────────────┐
│ Timeline for request_id: req_abc123                 │
│                                                     │
│ ▼ 14:32:01.234 [API Gateway]                       │
│   POST /api/payment received                        │
│                                                     │
│ ▼ 14:32:01.456 [Auth Service]                      │
│   Token validated for user_789                      │
│                                                     │
│ ▼ 14:32:02.123 [Payment Service]                   │
│   Stripe API called                                 │
│                                                     │
│ ⚠ 14:32:03.789 [Payment Service] ERROR             │
│   Card declined: insufficient_funds                 │
│                                                     │
│ ▼ 14:32:04.012 [Database]                          │
│   Transaction rolled back                           │
└─────────────────────────────────────────────────────┘

Phase 3: Advanced Correlation

  • Correlation across multiple IDs (e.g., same user_id + different request_ids)
  • Waterfall view showing service dependencies
  • Automatic time range expansion (search ±5 minutes from clicked log)

Alternatives Considered

  1. Manual search only - Current state, too time-consuming
  2. OpenTelemetry traces required - Too heavyweight, not all apps use OTLP
  3. Pre-defined correlation rules - Too rigid, doesn't handle custom IDs
  4. Graph database for relationships - Over-engineered, adds complexity
  5. Regex-based correlation - Part of solution, but needs smart defaults

Chosen approach: Smart auto-detection with user-configurable patterns

Implementation Details (Optional)

Technical approach:

1. ID Extraction at Ingestion

// During log ingestion, extract structured IDs
interface ExtractedIDs {
  request_id?: string;
  trace_id?: string;
  span_id?: string;
  user_id?: string;
  transaction_id?: string;
  custom_ids: Record<string, string>;
}

function extractIDs(logEntry: LogEntry): ExtractedIDs {
  const patterns = {
    request_id: /(?:request_id|req_id|requestId)[:\s=]+([a-zA-Z0-9_-]+)/,
    trace_id: /(?:trace_id|traceId|x-trace-id)[:\s=]+([a-f0-9-]+)/,
    user_id: /(?:user_id|userId|uid)[:\s=]+([a-zA-Z0-9_-]+)/,
    // ... more patterns
  };
  
  // Also check structured fields (JSON logs)
  // Also check OTLP attributes
  
  return extractedIDs;
}

2. Database Schema

-- Store extracted IDs for fast correlation queries
CREATE TABLE log_identifiers (
  log_id UUID REFERENCES logs(id),
  identifier_type VARCHAR(50), -- 'request_id', 'user_id', etc.
  identifier_value VARCHAR(255),
  timestamp TIMESTAMPTZ,
  
  PRIMARY KEY (log_id, identifier_type),
  INDEX idx_identifier_lookup (identifier_type, identifier_value, timestamp)
);

-- Query for correlation:
SELECT l.* 
FROM logs l
JOIN log_identifiers li ON l.id = li.log_id
WHERE li.identifier_type = 'request_id' 
  AND li.identifier_value = 'req_abc123'
ORDER BY l.timestamp;

3. UI Implementation

// Make IDs clickable in log viewer
function renderLogMessage(message: string, extractedIDs: ExtractedIDs) {
  let rendered = message;
  
  for (const [type, value] of Object.entries(extractedIDs)) {
    rendered = rendered.replace(
      value,
      `<a class="correlation-link" data-type="${type}" data-value="${value}">
        ${value}
      </a>`
    );
  }
  
  return rendered;
}

// Handle click
function onCorrelationLinkClick(type: string, value: string) {
  // Open correlation view modal or sidebar
  showCorrelationTimeline(type, value);
}

4. Configuration UI

# User-configurable correlation patterns
correlation_patterns:
  - name: "Request ID"
    pattern: "(?:request_id|req)[:\s=]+([a-zA-Z0-9_-]+)"
    enabled: true
  
  - name: "Order ID"
    pattern: "(?:order_id|order)[:\s=]+([a-zA-Z0-9_-]+)"
    enabled: true
    
  - name: "Custom Job ID"
    pattern: "job_([a-f0-9-]+)"
    enabled: true

Performance considerations:

  • Index log_identifiers table heavily
  • Limit correlation queries to reasonable time windows (default ±5min, max ±1 hour)
  • Cache common correlation queries
  • Lazy-load timeline entries (load more as user scrolls)

Priority

  • Critical - Blocking my usage of LogTide
  • High - Would significantly improve my workflow
  • Medium - Nice to have
  • Low - Minor enhancement

Rationale: This is a core differentiator that transforms Logtide from "log search" to "incident investigation tool". It's the kind of feature that's genuinely hard to replicate and provides massive time savings during critical debugging sessions.

Target Users

  • DevOps Engineers (primary: incident response)
  • Developers (primary: debugging distributed systems)
  • Security/SIEM Users (secondary benefit)
  • System Administrators
  • All Users

Primary audience: Teams running microservices, distributed systems, or any architecture where a single user action involves multiple services/components.

Additional Context

Why this is a moat:

  • Grep can't do this across services
  • Basic log search tools don't provide timeline visualization
  • Implementing well requires deep understanding of distributed tracing concepts
  • The UX matters enormously (auto-detection vs manual configuration)

Competitive analysis:

  • Datadog APM: Has this via distributed tracing, but requires APM agent installation ($$$)
  • Elastic APM: Similar to Datadog, heavyweight setup
  • Grafana Loki: No built-in correlation, requires LogQL wizardry
  • Splunk: Has transaction correlation, but enterprise-only and complex
  • Logtide advantage: Works with plain logs, no agents required

User testimonial (hypothetical):

"Before Logtide correlation: 20 minutes to debug a payment failure across 5 services. After: 30 seconds. I click the request_id and see the entire story."

Marketing message:

"Click any ID, see the whole story. Logtide automatically correlates logs across your entire system. No agents, no distributed tracing setup required."

Future enhancements:

  • Service dependency graph (automatically discover which services talk to each other)
  • Anomaly highlighting (automatically flag unusual patterns in correlated logs)
  • Export timeline as shareable link for incident reports
  • Integration with alerting (when alert fires, show correlated timeline)

Example workflows:

Debugging production incident:

1. Alert fires: "Payment service errors spiking"
2. Click alert → See recent error logs
3. Click request_id on any error
4. Timeline shows:
   - Frontend made request
   - API validated input
   - Payment service called Stripe
   - Stripe returned timeout
   - Database transaction rolled back
   - Queue scheduled retry
5. Root cause identified: Stripe API degradation

Understanding user journey:

1. Support ticket: "User says checkout is broken"
2. Search for user_id: user_12345
3. See all actions for that user:
   - Viewed product page
   - Added to cart
   - Started checkout
   - ERROR: Tax calculation failed
   - Abandoned cart
4. Fix tax calculation bug

Contribution

  • I would like to work on implementing this feature

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions