-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Systematic elimination of crash sources and silent error swallowing across the codebase.
Status: IN PROGRESS
Started: 2026-02-03
Commits: 21 (as of edb3624)
Silent catches fixed: 27 of 42 (64% complete)
Completed Fixes
Phase 1: Critical Crash Vectors (Commits 1-17)
Upstream Cherry-Picks
-
✅ 6863960 - Zombie process prevention
- LSP client: 5s timeout + SIGKILL escalation
- interactive-bash: Await proc.exited after kill
- tmux-utils: Process cleanup improvements
-
✅ 91bdef8 - Tmux orphan process fix
- Send Ctrl+C before kill-pane to prevent orphans
Our Critical Fixes
-
✅ bec332e - Unhandled Promise rejection in processKey()
- Added .catch() to fire-and-forget processKey() call
- Added try/catch around acquire() for cancellation handling
- File: src/features/background-agent/manager.ts
-
✅ 7f3eaca - findTmuxPath() timeout bug
- Added 5s timeout using Promise.race()
- Kills process on timeout to prevent zombie
- File: src/tools/interactive-bash/utils.ts
-
✅ c319946 - LSP client startup timeout
- Added 30s STARTUP_TIMEOUT constant
- Wraps client.start() and initialize() with Promise.race()
- File: src/tools/lsp/client.ts
-
✅ ff17db2 - execSync blocks entire thread
- Replaced execSync("gh auth token") with async spawn + 5s timeout
- File: src/features/copilot-usage/index.ts
-
✅ 2da3fe9 - UsageTracker unbounded memory growth
- Added lastPruneTime timestamp
- Call pruneOldRecords() in flush() every hour
- File: src/features/usage-tracker/tracker.ts
-
✅ c3bc23b - BackgroundManager timer accumulation leak
- Added removalTimers Map to track setTimeout IDs
- Clear all timers in shutdown()
- File: src/features/background-agent/manager.ts
-
✅ 5854c30 - Global error handlers (CRITICAL)
- Added process.on('unhandledRejection') and process.on('uncaughtException')
- Flushes session state before crash
- File: src/index.ts
Budget Orchestrator Fixes (Commits 10-12)
- ✅ Fixed stale model references (glm-4.7-flash, qwen3-coder-flash don't exist)
- ✅ Fixed OPENCODE_FREE_MODELS set
- ✅ Fixed gpt-5-nano BYOK bypass
- Files: global-override.ts, tiers.ts, presets.ts
Phase 2: Silent Error Elimination (Commits 18-21)
Pattern: Replace .catch(() => {}) with .catch((err: unknown) => { log(...) })
- ✅ ded239f - BackgroundManager & SkillMcpManager (5 catches)
- ✅ 412aa43 - auto-update-checker (8 catches)
- ✅ f9a74a8 - task-toast-manager (2 catches)
- ✅ abb684d - claude-code-hooks (2 catches)
- ✅ edb3624 - anthropic-context-window-limit-recovery (10 catches)
Total fixed: 27 of 42 silent catches (64%)
In Progress
Remaining Silent Catches (~15)
Next targets:
- ralph-loop (3 catches)
- session-notification-utils (6 catches)
- session-notification (8 catches)
- session-recovery (2 catches)
- Others (scattered)
Full list available via:
grep -rn "\.catch(() => {})" src/ --include="*.ts" --exclude="*.test.ts"Not Yet Started
1. MCP Connection Timeout Protection
Priority: High
Pattern: Similar to LSP client fix (commit c319946)
- File:
src/features/skill-mcp-manager/manager.ts - Function:
getOrCreateClient() - Issue: No timeout on MCP client connections
- Fix: Add 30s timeout wrapper using Promise.race()
2. Session State Race Conditions
Priority: High
Files: src/features/claude-code-session-state/state.ts
- Audit Map/Set operations for atomicity
- Check concurrent read/write patterns
- Consider adding mutex for critical sections
3. Event Listener Leaks
Priority: Medium
Search for:
addEventListener/.on()without corresponding cleanupsetIntervalnot cleared in shutdown- Patterns:
grep -rn "addEventListener\|\.on(" src/ --include="*.ts"
4. Stress Testing
Priority: Low (after fixes complete)
- Spawn 100 background tasks simultaneously
- Test hung LSP server scenarios
- Test hung MCP server scenarios
- Memory leak profiling under load
Test Results (Post-Fixes)
All tests passing:
- ✅ background-agent: 89 pass, 0 fail
- ✅ usage-tracker: 38 pass, 0 fail
- ✅ lsp: 19 pass, 0 fail
- ✅ copilot-usage: 15 pass, 0 fail
- ✅ skill-mcp-manager: 38 pass, 0 fail
Key Insights
Root Causes Identified:
- Concurrency bugs: Fire-and-forget promises, slot leaks, acquire cancellation
- Timeout bugs: LSP startup, MCP connections, execSync, findTmuxPath
- Memory leaks: UsageTracker records, BackgroundManager timers, task retention
- Process leaks: Missing proc.kill(), zombie processes, orphaned tmux panes
- Silent errors: 42
.catch(() => {})hiding production failures - Missing handlers: No global unhandledRejection/uncaughtException
Impact:
- Eliminated 7 most critical crash vectors
- 64% of silent error swallowing eliminated
- All tests still passing (no regressions)
Next Steps
- ✅ Document guidelines in CLAUDE.md (commit c43b28e)
- 🔄 Fix remaining 15 silent catches
- ⏳ Add MCP connection timeout protection
- ⏳ Audit session state for race conditions
- ⏳ Audit for event listener leaks
- ⏳ Create stress tests
Related Issues
- None yet (this is the first comprehensive crash tracking issue)
Commands for Reference
# Count remaining silent catches
grep -rn "\.catch(() => {})" src/ --include="*.ts" --exclude="*.test.ts" | wc -l
# List all silent catches with context
grep -rn "\.catch(() => {})" src/ --include="*.ts" --exclude="*.test.ts"
# Verify typecheck
bun run typecheck
# Run tests
bun test src/features/background-agent/
bun test src/features/usage-tracker/
bun test src/tools/lsp/
# View recent commits
git log --oneline -25