Skip to content

fix: comprehensive crash prevention initiative #11

@sadnow

Description

@sadnow

Summary

Systematic elimination of crash sources and silent error swallowing across the codebase.

Status: IN PROGRESS

Started: 2026-02-03
Commits: 21 (as of edb3624)
Silent catches fixed: 27 of 42 (64% complete)


Completed Fixes

Phase 1: Critical Crash Vectors (Commits 1-17)

Upstream Cherry-Picks

  • 6863960 - Zombie process prevention

    • LSP client: 5s timeout + SIGKILL escalation
    • interactive-bash: Await proc.exited after kill
    • tmux-utils: Process cleanup improvements
  • 91bdef8 - Tmux orphan process fix

    • Send Ctrl+C before kill-pane to prevent orphans

Our Critical Fixes

  • bec332e - Unhandled Promise rejection in processKey()

    • Added .catch() to fire-and-forget processKey() call
    • Added try/catch around acquire() for cancellation handling
    • File: src/features/background-agent/manager.ts
  • 7f3eaca - findTmuxPath() timeout bug

    • Added 5s timeout using Promise.race()
    • Kills process on timeout to prevent zombie
    • File: src/tools/interactive-bash/utils.ts
  • c319946 - LSP client startup timeout

    • Added 30s STARTUP_TIMEOUT constant
    • Wraps client.start() and initialize() with Promise.race()
    • File: src/tools/lsp/client.ts
  • ff17db2 - execSync blocks entire thread

    • Replaced execSync("gh auth token") with async spawn + 5s timeout
    • File: src/features/copilot-usage/index.ts
  • 2da3fe9 - UsageTracker unbounded memory growth

    • Added lastPruneTime timestamp
    • Call pruneOldRecords() in flush() every hour
    • File: src/features/usage-tracker/tracker.ts
  • c3bc23b - BackgroundManager timer accumulation leak

    • Added removalTimers Map to track setTimeout IDs
    • Clear all timers in shutdown()
    • File: src/features/background-agent/manager.ts
  • 5854c30 - Global error handlers (CRITICAL)

    • Added process.on('unhandledRejection') and process.on('uncaughtException')
    • Flushes session state before crash
    • File: src/index.ts

Budget Orchestrator Fixes (Commits 10-12)

  • ✅ Fixed stale model references (glm-4.7-flash, qwen3-coder-flash don't exist)
  • ✅ Fixed OPENCODE_FREE_MODELS set
  • ✅ Fixed gpt-5-nano BYOK bypass
  • Files: global-override.ts, tiers.ts, presets.ts

Phase 2: Silent Error Elimination (Commits 18-21)

Pattern: Replace .catch(() => {}) with .catch((err: unknown) => { log(...) })

  • ded239f - BackgroundManager & SkillMcpManager (5 catches)
  • 412aa43 - auto-update-checker (8 catches)
  • f9a74a8 - task-toast-manager (2 catches)
  • abb684d - claude-code-hooks (2 catches)
  • edb3624 - anthropic-context-window-limit-recovery (10 catches)

Total fixed: 27 of 42 silent catches (64%)


In Progress

Remaining Silent Catches (~15)

Next targets:

  1. ralph-loop (3 catches)
  2. session-notification-utils (6 catches)
  3. session-notification (8 catches)
  4. session-recovery (2 catches)
  5. Others (scattered)

Full list available via:

grep -rn "\.catch(() => {})" src/ --include="*.ts" --exclude="*.test.ts"

Not Yet Started

1. MCP Connection Timeout Protection

Priority: High
Pattern: Similar to LSP client fix (commit c319946)

  • File: src/features/skill-mcp-manager/manager.ts
  • Function: getOrCreateClient()
  • Issue: No timeout on MCP client connections
  • Fix: Add 30s timeout wrapper using Promise.race()

2. Session State Race Conditions

Priority: High
Files: src/features/claude-code-session-state/state.ts

  • Audit Map/Set operations for atomicity
  • Check concurrent read/write patterns
  • Consider adding mutex for critical sections

3. Event Listener Leaks

Priority: Medium

Search for:

  • addEventListener / .on() without corresponding cleanup
  • setInterval not cleared in shutdown
  • Patterns: grep -rn "addEventListener\|\.on(" src/ --include="*.ts"

4. Stress Testing

Priority: Low (after fixes complete)

  • Spawn 100 background tasks simultaneously
  • Test hung LSP server scenarios
  • Test hung MCP server scenarios
  • Memory leak profiling under load

Test Results (Post-Fixes)

All tests passing:

  • ✅ background-agent: 89 pass, 0 fail
  • ✅ usage-tracker: 38 pass, 0 fail
  • ✅ lsp: 19 pass, 0 fail
  • ✅ copilot-usage: 15 pass, 0 fail
  • ✅ skill-mcp-manager: 38 pass, 0 fail

Key Insights

Root Causes Identified:

  1. Concurrency bugs: Fire-and-forget promises, slot leaks, acquire cancellation
  2. Timeout bugs: LSP startup, MCP connections, execSync, findTmuxPath
  3. Memory leaks: UsageTracker records, BackgroundManager timers, task retention
  4. Process leaks: Missing proc.kill(), zombie processes, orphaned tmux panes
  5. Silent errors: 42 .catch(() => {}) hiding production failures
  6. Missing handlers: No global unhandledRejection/uncaughtException

Impact:

  • Eliminated 7 most critical crash vectors
  • 64% of silent error swallowing eliminated
  • All tests still passing (no regressions)

Next Steps

  1. ✅ Document guidelines in CLAUDE.md (commit c43b28e)
  2. 🔄 Fix remaining 15 silent catches
  3. ⏳ Add MCP connection timeout protection
  4. ⏳ Audit session state for race conditions
  5. ⏳ Audit for event listener leaks
  6. ⏳ Create stress tests

Related Issues

  • None yet (this is the first comprehensive crash tracking issue)

Commands for Reference

# Count remaining silent catches
grep -rn "\.catch(() => {})" src/ --include="*.ts" --exclude="*.test.ts" | wc -l

# List all silent catches with context
grep -rn "\.catch(() => {})" src/ --include="*.ts" --exclude="*.test.ts"

# Verify typecheck
bun run typecheck

# Run tests
bun test src/features/background-agent/
bun test src/features/usage-tracker/
bun test src/tools/lsp/

# View recent commits
git log --oneline -25

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions