fix(upstream): use parent context for shutdown detection in Docker recovery #126

Dumbris · 2025-11-03T20:06:41Z

Summary

Fixed a race condition where Docker recovery retry logging occurred after test completion despite having a context check from PR #124. The issue was using the child recoveryCtx instead of parent ctx for shutdown detection.

Root Cause

PR #124 added a context check before logging, but it checked the wrong context:

// PR #124 added this:
select {
case <-recoveryCtx.Done():  // ❌ Wrong context!
    return
default:
}
m.logger.Info("Docker recovery retry", ...)

The problem:

ctx = Parent context (cancelled when test/manager shuts down)
recoveryCtx = Child context created with context.WithCancel(ctx)
Child context cancellation may lag behind parent by nanoseconds
Race window: Check passes → Context cancelled → Log executes

On CI machines with scheduling delays, this tiny race window is hit more frequently.

Changes

File: internal/upstream/manager.go

Use parent context for shutdown detection (line 1905):

// Before (PR #124):
select {
case <-recoveryCtx.Done():  // ❌ Child context
    return
default:
}

// After (this PR):
select {
case <-ctx.Done():  // ✅ Parent context
    return
default:
}

Why This Works

Parent ctx is cancelled immediately when shutdown starts
No propagation delay to child context
Eliminates race window between check and log execution
Test shutdown is instantly visible to the check

Test Results

✅ Tests pass locally with the fix:

go test -v -race -run TestUpdateListenAddressValidation ./internal/runtime/...

Fixes

Main branch failure:

Unit Tests (macos-latest, 1.23.10)
https://github.com/smart-mcp-proxy/mcpproxy-go/actions/runs/19047644560

Error:

panic: Log in goroutine after TestUpdateListenAddressValidation has completed: 
2025-11-03T20:02:22.176Z	INFO	Docker recovery retry	
{"attempt": 1, "elapsed": "2.002340625s", "next_retry_in": "5s"}

mcpproxy-go/internal/upstream.(*Manager).handleDockerUnavailable
	internal/upstream/manager.go:1910 +0x940

Related PRs

PR fix(upstream): prevent logging after context cancellation in Docker recovery retry #124: Added context check (used wrong context)
PR fix(upstream): use parent context for shutdown detection in Docker recovery #126: Fixes context check to use parent context (this PR)

🤖 Generated with Claude Code

…covery Fixed a race condition where Docker recovery retry logging occurred after test completion despite having a context check. The issue was using the child `recoveryCtx` instead of parent `ctx` for shutdown detection. Root Cause: - Child context (`recoveryCtx`) cancellation may lag behind parent (`ctx`) - Test cleanup cancels parent context immediately - Tiny window between context check and log execution - On CI machines, this race is more likely due to scheduling delays Changes: - Check parent `ctx.Done()` instead of `recoveryCtx.Done()` (line 1905) - Parent context reflects shutdown state more immediately - Eliminates race window between check and log execution This completes the fix from PR #124 by using the correct context for shutdown detection. Fixes: - Unit Tests (macos-latest, 1.23.10) on main branch - Race condition: "Log in goroutine after test has completed" Error: ``` panic: Log in goroutine after TestUpdateListenAddressValidation has completed: 2025-11-03T20:02:22.176Z INFO Docker recovery retry {"attempt": 1, "elapsed": "2.002340625s", "next_retry_in": "5s"} mcpproxy-go/internal/upstream.(*Manager).handleDockerUnavailable internal/upstream/manager.go:1910 ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2025-11-03T20:11:36Z

📦 Build Artifacts for PR #126

Version: v0.9.24-dev-ea68fa7-pr126
Latest tag: v0.9.24

Available Artifacts

frontend-dist-pr (177 KB)
archive-linux-amd64 (9.7 MB)
archive-linux-arm64 (8.8 MB)
archive-darwin-arm64 (16 MB)
installer-dmg-darwin-arm64 (18 MB)
archive-windows-arm64 (16 MB)
archive-windows-amd64 (18 MB)
archive-darwin-amd64 (18 MB)
installer-dmg-darwin-amd64 (20 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 19047803960 --repo smart-mcp-proxy/mcpproxy-go

# Or download a specific artifact:
gh run download 19047803960 --name dmg-darwin-arm64

Note: Artifacts expire in 14 days.

Complete the fix from PR #126 by using parent context consistently throughout handleDockerUnavailable. This eliminates the race condition where child context cancellation lags parent by nanoseconds. Changes: - Line 1841: Changed recoveryCtx.Done() -> ctx.Done() in main select - Line 1846: Changed recoveryCtx.Done() -> ctx.Done() after timer fires This ensures immediate shutdown detection when tests complete, preventing "Log in goroutine after test has completed" panics. Fixes the persistent test failure in TestUpdateListenAddressValidation on macOS. Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…ecks Replace select-default pattern with ctx.Err() checks to eliminate TOCTOU race condition in Docker recovery monitoring goroutines. The select-default pattern creates a race window where context can be cancelled AFTER the check passes but BEFORE logging happens, causing "Log in goroutine after test has completed" panics. Using ctx.Err() provides atomic context cancellation checking without the race condition. Changes: - handleDockerUnavailable: Use ctx.Err() for all context checks before logging - startDockerRecoveryMonitor: Use ctx.Err() in initial check and periodic monitoring - Added ctx.Err() check before "Docker recovery successful" logging Fixes the persistent test failure in TestUpdateListenAddressValidation that survived PR #126 and PR #127. Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…utdown ## Problem After PR #129, tests were still failing with "panic: Log in goroutine after test completed" on macOS. The `ctx.Err()` checks prevented the race between checking context and logging, but goroutines could still be running when tests completed because there was no synchronization mechanism. ## Root Cause The Manager struct had `shutdownCtx` and `shutdownCancel` for signaling shutdown, but **no `sync.WaitGroup` to track background goroutines**. When `ShutdownAll()` or `DisconnectAll()` was called: 1. Context was cancelled 2. Function returned immediately 3. Background goroutines (OAuth monitor, Docker recovery) were still running 4. Test completed and marked `testing.T` as done 5. Goroutines tried to log → panic ## Solution Added proper goroutine lifecycle management: 1. **Manager.shutdownWg** field - Tracks all background goroutines 2. **Goroutine launches** - Call `shutdownWg.Add(1)` before spawning: - `startOAuthEventMonitor` (in NewManager) - `startDockerRecoveryMonitor` (in NewManager and dynamic restart) 3. **Goroutine exits** - Call `defer shutdownWg.Done()` in both monitor functions 4. **Shutdown waiting** - Call `shutdownWg.Wait()` in: - `ShutdownAll()` - with 3-second timeout - `DisconnectAll()` - blocking wait ## Changes - `Manager` struct: Added `shutdownWg sync.WaitGroup` - `NewManager()`: Added `Add(1)` before launching monitors - `startOAuthEventMonitor()`: Added `defer Done()` at start - `startDockerRecoveryMonitor()`: Added `defer Done()` at start - Dynamic recovery restart: Added `Add(1)` before `go startDockerRecoveryMonitor()` - `ShutdownAll()`: Cancel context, wait for goroutines with timeout - `DisconnectAll()`: Cancel context, wait for goroutines ## Testing ```bash go test ./internal/runtime -run TestUpdateListenAddressValidation -v -race -count=5 # Result: 5/5 passed, no race conditions go test ./internal/upstream ./internal/runtime -v -race -count=2 # Result: All tests passed ``` ## Related - Fixes persistent issue from PR #122, #126, #127, #129 - Complements the `ctx.Err()` checks added in PR #129 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Dumbris merged commit fd416c1 into main Nov 4, 2025
20 checks passed

Dumbris mentioned this pull request Nov 4, 2025

fix(upstream): use parent context in Docker recovery timer checks #127

Merged

Dumbris mentioned this pull request Nov 4, 2025

fix(upstream): use ctx.Err() instead of select-default for context checks #129

Merged

Dumbris mentioned this pull request Nov 4, 2025

fix(upstream): add WaitGroup to track background goroutines during shutdown #130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(upstream): use parent context for shutdown detection in Docker recovery #126

fix(upstream): use parent context for shutdown detection in Docker recovery #126

Uh oh!

Dumbris commented Nov 3, 2025

Uh oh!

github-actions bot commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(upstream): use parent context for shutdown detection in Docker recovery #126

fix(upstream): use parent context for shutdown detection in Docker recovery #126

Uh oh!

Conversation

Dumbris commented Nov 3, 2025

Summary

Root Cause

Changes

Why This Works

Test Results

Fixes

Related PRs

Uh oh!

github-actions bot commented Nov 3, 2025

📦 Build Artifacts for PR #126

Available Artifacts

How to Download

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants