Skip to content

Conversation

@Dumbris
Copy link
Contributor

@Dumbris Dumbris commented Nov 3, 2025

Summary

Fixed a race condition where Docker recovery retry logging occurred after test completion despite having a context check from PR #124. The issue was using the child recoveryCtx instead of parent ctx for shutdown detection.

Root Cause

PR #124 added a context check before logging, but it checked the wrong context:

// PR #124 added this:
select {
case <-recoveryCtx.Done():  // ❌ Wrong context!
    return
default:
}
m.logger.Info("Docker recovery retry", ...)

The problem:

  1. ctx = Parent context (cancelled when test/manager shuts down)
  2. recoveryCtx = Child context created with context.WithCancel(ctx)
  3. Child context cancellation may lag behind parent by nanoseconds
  4. Race window: Check passes → Context cancelled → Log executes

On CI machines with scheduling delays, this tiny race window is hit more frequently.

Changes

File: internal/upstream/manager.go

Use parent context for shutdown detection (line 1905):

// Before (PR #124):
select {
case <-recoveryCtx.Done():  // ❌ Child context
    return
default:
}

// After (this PR):
select {
case <-ctx.Done():  // ✅ Parent context
    return
default:
}

Why This Works

  • Parent ctx is cancelled immediately when shutdown starts
  • No propagation delay to child context
  • Eliminates race window between check and log execution
  • Test shutdown is instantly visible to the check

Test Results

✅ Tests pass locally with the fix:

go test -v -race -run TestUpdateListenAddressValidation ./internal/runtime/...

Fixes

Main branch failure:

Error:

panic: Log in goroutine after TestUpdateListenAddressValidation has completed: 
2025-11-03T20:02:22.176Z	INFO	Docker recovery retry	
{"attempt": 1, "elapsed": "2.002340625s", "next_retry_in": "5s"}

mcpproxy-go/internal/upstream.(*Manager).handleDockerUnavailable
	internal/upstream/manager.go:1910 +0x940

Related PRs

🤖 Generated with Claude Code

…covery

Fixed a race condition where Docker recovery retry logging occurred after
test completion despite having a context check. The issue was using the
child `recoveryCtx` instead of parent `ctx` for shutdown detection.

Root Cause:
- Child context (`recoveryCtx`) cancellation may lag behind parent (`ctx`)
- Test cleanup cancels parent context immediately
- Tiny window between context check and log execution
- On CI machines, this race is more likely due to scheduling delays

Changes:
- Check parent `ctx.Done()` instead of `recoveryCtx.Done()` (line 1905)
- Parent context reflects shutdown state more immediately
- Eliminates race window between check and log execution

This completes the fix from PR #124 by using the correct context for
shutdown detection.

Fixes:
- Unit Tests (macos-latest, 1.23.10) on main branch
- Race condition: "Log in goroutine after test has completed"

Error:
```
panic: Log in goroutine after TestUpdateListenAddressValidation has completed:
2025-11-03T20:02:22.176Z	INFO	Docker recovery retry
{"attempt": 1, "elapsed": "2.002340625s", "next_retry_in": "5s"}

mcpproxy-go/internal/upstream.(*Manager).handleDockerUnavailable
	internal/upstream/manager.go:1910
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Nov 3, 2025

📦 Build Artifacts for PR #126

Version: v0.9.24-dev-ea68fa7-pr126
Latest tag: v0.9.24

Available Artifacts

  • frontend-dist-pr (177 KB)
  • archive-linux-amd64 (9.7 MB)
  • archive-linux-arm64 (8.8 MB)
  • archive-darwin-arm64 (16 MB)
  • installer-dmg-darwin-arm64 (18 MB)
  • archive-windows-arm64 (16 MB)
  • archive-windows-amd64 (18 MB)
  • archive-darwin-amd64 (18 MB)
  • installer-dmg-darwin-amd64 (20 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 19047803960 --repo smart-mcp-proxy/mcpproxy-go

# Or download a specific artifact:
gh run download 19047803960 --name dmg-darwin-arm64

Note: Artifacts expire in 14 days.

@Dumbris Dumbris merged commit fd416c1 into main Nov 4, 2025
20 checks passed
Dumbris added a commit that referenced this pull request Nov 4, 2025
Complete the fix from PR #126 by using parent context consistently
throughout handleDockerUnavailable. This eliminates the race condition
where child context cancellation lags parent by nanoseconds.

Changes:
- Line 1841: Changed recoveryCtx.Done() -> ctx.Done() in main select
- Line 1846: Changed recoveryCtx.Done() -> ctx.Done() after timer fires

This ensures immediate shutdown detection when tests complete, preventing
"Log in goroutine after test has completed" panics.

Fixes the persistent test failure in TestUpdateListenAddressValidation
on macOS.

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Dumbris added a commit that referenced this pull request Nov 4, 2025
…ecks

Replace select-default pattern with ctx.Err() checks to eliminate TOCTOU
race condition in Docker recovery monitoring goroutines.

The select-default pattern creates a race window where context can be
cancelled AFTER the check passes but BEFORE logging happens, causing
"Log in goroutine after test has completed" panics.

Using ctx.Err() provides atomic context cancellation checking without
the race condition.

Changes:
- handleDockerUnavailable: Use ctx.Err() for all context checks before logging
- startDockerRecoveryMonitor: Use ctx.Err() in initial check and periodic monitoring
- Added ctx.Err() check before "Docker recovery successful" logging

Fixes the persistent test failure in TestUpdateListenAddressValidation
that survived PR #126 and PR #127.

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Dumbris added a commit that referenced this pull request Nov 4, 2025
…utdown

## Problem
After PR #129, tests were still failing with "panic: Log in goroutine after
test completed" on macOS. The `ctx.Err()` checks prevented the race between
checking context and logging, but goroutines could still be running when tests
completed because there was no synchronization mechanism.

## Root Cause
The Manager struct had `shutdownCtx` and `shutdownCancel` for signaling
shutdown, but **no `sync.WaitGroup` to track background goroutines**. When
`ShutdownAll()` or `DisconnectAll()` was called:
1. Context was cancelled
2. Function returned immediately
3. Background goroutines (OAuth monitor, Docker recovery) were still running
4. Test completed and marked `testing.T` as done
5. Goroutines tried to log → panic

## Solution
Added proper goroutine lifecycle management:

1. **Manager.shutdownWg** field - Tracks all background goroutines
2. **Goroutine launches** - Call `shutdownWg.Add(1)` before spawning:
   - `startOAuthEventMonitor` (in NewManager)
   - `startDockerRecoveryMonitor` (in NewManager and dynamic restart)
3. **Goroutine exits** - Call `defer shutdownWg.Done()` in both monitor functions
4. **Shutdown waiting** - Call `shutdownWg.Wait()` in:
   - `ShutdownAll()` - with 3-second timeout
   - `DisconnectAll()` - blocking wait

## Changes
- `Manager` struct: Added `shutdownWg sync.WaitGroup`
- `NewManager()`: Added `Add(1)` before launching monitors
- `startOAuthEventMonitor()`: Added `defer Done()` at start
- `startDockerRecoveryMonitor()`: Added `defer Done()` at start
- Dynamic recovery restart: Added `Add(1)` before `go startDockerRecoveryMonitor()`
- `ShutdownAll()`: Cancel context, wait for goroutines with timeout
- `DisconnectAll()`: Cancel context, wait for goroutines

## Testing
```bash
go test ./internal/runtime -run TestUpdateListenAddressValidation -v -race -count=5
# Result: 5/5 passed, no race conditions

go test ./internal/upstream ./internal/runtime -v -race -count=2
# Result: All tests passed
```

## Related
- Fixes persistent issue from PR #122, #126, #127, #129
- Complements the `ctx.Err()` checks added in PR #129

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants