Exclude unhealthy backends from vMCP capability aggregation by yrobla · Pull Request #3642 · stacklok/toolhive

yrobla · 2026-02-06T09:10:57Z

Implement backend health filtering to prevent exposing tools from unavailable backends when circuit breakers are open, addressing issue #3036 acceptance criterion.

When backends become unhealthy (circuit breaker open, authentication failures, unknown status), their tools are automatically excluded from the aggregated capabilities returned to clients. When backends recover, tools are automatically restored on the next session initialization.

Changes:

Add filterHealthyBackends() in discovery middleware to exclude unhealthy/unknown/unauthenticated backends from aggregation
Include healthy and degraded backends (operational but slow)
Move circuit breaker and health check validation from commands.go to pkg/vmcp/config/validator.go for better separation of concerns
Add comprehensive unit tests for backend filtering (8 test cases)
Add validation tests for failure handling config (11 test cases)
Update E2E circuit breaker test with capability filtering notes

Related-to: #3036

Implement backend health filtering to prevent exposing tools from unavailable backends when circuit breakers are open, addressing issue #3036 acceptance criterion. When backends become unhealthy (circuit breaker open, authentication failures, unknown status), their tools are automatically excluded from the aggregated capabilities returned to clients. When backends recover, tools are automatically restored on the next session initialization. Changes: - Add filterHealthyBackends() in discovery middleware to exclude unhealthy/unknown/unauthenticated backends from aggregation - Include healthy and degraded backends (operational but slow) - Move circuit breaker and health check validation from commands.go to pkg/vmcp/config/validator.go for better separation of concerns - Add comprehensive unit tests for backend filtering (8 test cases) - Add validation tests for failure handling config (11 test cases) - Update E2E circuit breaker test with capability filtering notes Related-to: #3036

Copilot

Pull request overview

This PR implements backend health filtering for vMCP capability aggregation, ensuring that tools from unhealthy backends are automatically excluded when their circuit breakers open or when they experience authentication failures. The implementation filters backends at session initialization time, and the filtered capabilities are cached for the session lifetime.

Changes:

Add filterHealthyBackends() function to exclude unhealthy/unknown/unauthenticated backends while including healthy and degraded backends in capability aggregation
Move circuit breaker and health check validation logic from commands.go to validator.go for better separation of concerns
Add comprehensive test coverage for both filtering logic (8 test cases) and validation logic (11 test cases)

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pkg/vmcp/discovery/middleware.go	Adds filterHealthyBackends() function that filters backends by health status before capability discovery
pkg/vmcp/discovery/middleware_test.go	Adds TestFilterHealthyBackends with 8 comprehensive test cases; updates existing tests to set HealthStatus field
pkg/vmcp/config/validator.go	Moves validation logic from commands.go and enhances error messages with actual values
pkg/vmcp/config/validator_test.go	Adds TestValidator_ValidateFailureHandling with 11 test cases covering all validation scenarios
cmd/vmcp/app/commands.go	Removes duplicate validation logic now handled by validator.go
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go	Adds documentation notes explaining capability filtering behavior for circuit breaker scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-02-06T09:28:47Z

Codecov Report

❌ Patch coverage is 96.82540% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.05%. Comparing base (dad2bb8) to head (fae7c45).
⚠️ Report is 22 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/vmcp/health/monitor.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3642      +/-   ##
==========================================
+ Coverage   65.94%   66.05%   +0.10%     
==========================================
  Files         413      415       +2     
  Lines       40959    41186     +227     
==========================================
+ Hits        27010    27204     +194     
- Misses      11858    11886      +28     
- Partials     2091     2096       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/vmcp/discovery/middleware.go

test/e2e/thv-operator/virtualmcp/virtualmcp_yardstick_base_test.go

JAORMX · 2026-02-06T11:21:56Z

Is aggregation done periodically? What if there's a temporary downtime in one of the servers? will they be excluded?

yrobla · 2026-02-06T15:04:41Z

Is aggregation done periodically? What if there's a temporary downtime in one of the servers? will they be excluded?

No, aggregation happens at MCP initialization, not periodically. Capability aggregation happens during MCP protocol initialization (initialize request), then results are cached

Once capabilities are discovered at initialization, they're cached. If a backend
goes down after initialization, its tools remain in the capability list but will fail when
called. The health filtering only prevents unhealthy backends from being included at initialization time.

I can change to periodical in a follow-up

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

pkg/vmcp/config/validator.go:355

PartialFailureMode validation uses values {"fail", "bestEffort"}, but the config model/CRD docs describe the allowed value as "best_effort" (underscore). As written, a valid CRD/YAML value of "best_effort" will fail validation, and the added tests also appear to be asserting the camelCase form instead of the documented/CRD form. Align validator + tests with the documented/CRD enum values to avoid breaking Kubernetes configs.

	validModes := []string{"fail", "bestEffort"}
	if !contains(validModes, fh.PartialFailureMode) {
		return fmt.Errorf("partialFailureMode must be one of: %s", strings.Join(validModes, ", "))
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/vmcp/discovery/middleware.go

pkg/vmcp/config/validator.go

test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/vmcp/config/validator.go

pkg/vmcp/discovery/middleware.go

jerm-dro · 2026-02-06T15:42:24Z

excluded?

No, aggregation happens at MCP initialization, not periodically. Capability aggregation happens during MCP protocol initialization (initialize request), then results are cached

Once capabilities are discovered at initialization, they're cached. If a backend goes down after initialization, its tools remain in the capability list but will fail when called. The health filtering only prevents unhealthy backends from being included at initialization time.

I can change to periodical in a follow-up

@yrobla This is correct. However, I don't think we should to do this immediately in a follow-up. This has always been a gap. For example, if the underlying MCPServer capabilities change, the vMCP will not update the capabilities within a session.

Fixing this will require mutating session-scoped state and notifying the MCP client. We can add this functionality today, but I think we'd be better off improving the existing session code first (i.e. this WIP RFC: stacklok/toolhive-rfcs#39). What do you think?

yrobla · 2026-02-06T15:45:17Z

excluded?

No, aggregation happens at MCP initialization, not periodically. Capability aggregation happens during MCP protocol initialization (initialize request), then results are cached
Once capabilities are discovered at initialization, they're cached. If a backend goes down after initialization, its tools remain in the capability list but will fail when called. The health filtering only prevents unhealthy backends from being included at initialization time.
I can change to periodical in a follow-up

@yrobla This is correct. However, I don't think we should to do this immediately in a follow-up. This has always been a gap. For example, if the underlying MCPServer capabilities change, the vMCP will not update the capabilities within a session.

Fixing this will require mutating session-scoped state and notifying the MCP client. We can add this functionality today, but I think we'd be better off improving the existing session code first (i.e. this WIP RFC: stacklok/toolhive-rfcs#39). What do you think?

I can see some relationship if we check aggregation at session level, instead of doing that at initialize level. This will help. But also, this will not be periodical, is that what you meant relating that to sesions?

jerm-dro · 2026-02-06T15:55:46Z

I can see some relationship if we check aggregation at session level, instead of doing that at initialize level. This will help. But also, this will not be periodical, is that what you meant relating that to sesions?

I think so, but to add a bit more detail I'll revise the domain model I had suggested:

// Session represents an active MCP session with all its capabilities and resources.
// This is a pure domain object - no protocol concerns.
type Session interface {
    ID() string

    // Capabilities - returns discovered tools/resources for this session
    // PREVIOIUSLY, capabilities were static.
    // Tools() []Tool
    // Resources() []Resource
    
    // Capabilities() returns a channel containing the up to date capabilities within the session.
    // The channel is guaranteed to hold the initial set of capabilities on construction.
    // The channel will receive updated capabilities if the underlying MCP capabilities change
    // (e.g. circuit breaker, backend revised capabilities). 
    // The channel capacity is 1 to guarantee updates are consumed in order.
    Capabilities() <-chan Capabilities

    // MCP operations - routing logic is encapsulated here
    CallTool(ctx context.Context, name string, arguments map[string]any) (*ToolResult, error)
    ReadResource(ctx context.Context, uri string) (*ResourceResult, error)
    GetPrompt(ctx context.Context, name string, arguments map[string]any) (*PromptResult, error)

    // Lifecycle
    Close() error
}

The Capabilities method would encapsulate the "how do we recompute capabilities." Periodically re-running the same function in a goroutine would be fine. The caller handles the "how do we notify the MCP client of updated capabilities."

yrobla requested review from ChrisJBurns, JAORMX, amirejaz, jerm-dro and jhrozek as code owners February 6, 2026 09:10

github-actions bot added the size/M Medium PR: 300-599 lines changed label Feb 6, 2026

yrobla requested a review from Copilot February 6, 2026 09:16

Copilot started reviewing on behalf of yrobla February 6, 2026 09:16 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Feb 6, 2026

yrobla requested a review from Copilot February 6, 2026 10:12

Copilot started reviewing on behalf of yrobla February 6, 2026 10:12 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

pkg/vmcp/discovery/middleware.go Outdated Show resolved Hide resolved

test/e2e/thv-operator/virtualmcp/virtualmcp_yardstick_base_test.go Outdated Show resolved Hide resolved

reduce tests to avoid timeout

e225b73

yrobla force-pushed the issue-3036-v1-fix branch from 2a4a2a0 to e225b73 Compare February 6, 2026 10:18

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 6, 2026

yrobla requested a review from Copilot February 6, 2026 15:05

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 6, 2026

Copilot started reviewing on behalf of yrobla February 6, 2026 15:05 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

yrobla force-pushed the issue-3036-v1-fix branch from 7c92d54 to b1533af Compare February 6, 2026 15:27

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 6, 2026

yrobla requested a review from Copilot February 6, 2026 15:29

Copilot started reviewing on behalf of yrobla February 6, 2026 15:30 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

pkg/vmcp/config/validator.go Show resolved Hide resolved

pkg/vmcp/discovery/middleware.go Outdated Show resolved Hide resolved

changes from review

fae7c45

yrobla force-pushed the issue-3036-v1-fix branch from b1533af to fae7c45 Compare February 6, 2026 15:35

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 6, 2026

Conversation

yrobla commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

JAORMX commented Feb 6, 2026

Uh oh!

yrobla commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

jerm-dro commented Feb 6, 2026

Uh oh!

yrobla commented Feb 6, 2026

Uh oh!

jerm-dro commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Feb 6, 2026 •

edited

Loading