Exclude unhealthy backends from vMCP capability aggregation#3642
Exclude unhealthy backends from vMCP capability aggregation#3642
Conversation
Implement backend health filtering to prevent exposing tools from unavailable backends when circuit breakers are open, addressing issue #3036 acceptance criterion. When backends become unhealthy (circuit breaker open, authentication failures, unknown status), their tools are automatically excluded from the aggregated capabilities returned to clients. When backends recover, tools are automatically restored on the next session initialization. Changes: - Add filterHealthyBackends() in discovery middleware to exclude unhealthy/unknown/unauthenticated backends from aggregation - Include healthy and degraded backends (operational but slow) - Move circuit breaker and health check validation from commands.go to pkg/vmcp/config/validator.go for better separation of concerns - Add comprehensive unit tests for backend filtering (8 test cases) - Add validation tests for failure handling config (11 test cases) - Update E2E circuit breaker test with capability filtering notes Related-to: #3036
There was a problem hiding this comment.
Pull request overview
This PR implements backend health filtering for vMCP capability aggregation, ensuring that tools from unhealthy backends are automatically excluded when their circuit breakers open or when they experience authentication failures. The implementation filters backends at session initialization time, and the filtered capabilities are cached for the session lifetime.
Changes:
- Add filterHealthyBackends() function to exclude unhealthy/unknown/unauthenticated backends while including healthy and degraded backends in capability aggregation
- Move circuit breaker and health check validation logic from commands.go to validator.go for better separation of concerns
- Add comprehensive test coverage for both filtering logic (8 test cases) and validation logic (11 test cases)
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/vmcp/discovery/middleware.go | Adds filterHealthyBackends() function that filters backends by health status before capability discovery |
| pkg/vmcp/discovery/middleware_test.go | Adds TestFilterHealthyBackends with 8 comprehensive test cases; updates existing tests to set HealthStatus field |
| pkg/vmcp/config/validator.go | Moves validation logic from commands.go and enhances error messages with actual values |
| pkg/vmcp/config/validator_test.go | Adds TestValidator_ValidateFailureHandling with 11 test cases covering all validation scenarios |
| cmd/vmcp/app/commands.go | Removes duplicate validation logic now handled by validator.go |
| test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go | Adds documentation notes explaining capability filtering behavior for circuit breaker scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3642 +/- ##
==========================================
+ Coverage 65.94% 66.05% +0.10%
==========================================
Files 413 415 +2
Lines 40959 41186 +227
==========================================
+ Hits 27010 27204 +194
- Misses 11858 11886 +28
- Partials 2091 2096 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
test/e2e/thv-operator/virtualmcp/virtualmcp_yardstick_base_test.go
Outdated
Show resolved
Hide resolved
2a4a2a0 to
e225b73
Compare
|
Is aggregation done periodically? What if there's a temporary downtime in one of the servers? will they be excluded? |
No, aggregation happens at MCP initialization, not periodically. Capability aggregation happens during MCP protocol initialization (initialize request), then results are cached Once capabilities are discovered at initialization, they're cached. If a backend I can change to periodical in a follow-up |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
pkg/vmcp/config/validator.go:355
- PartialFailureMode validation uses values {"fail", "bestEffort"}, but the config model/CRD docs describe the allowed value as "best_effort" (underscore). As written, a valid CRD/YAML value of "best_effort" will fail validation, and the added tests also appear to be asserting the camelCase form instead of the documented/CRD form. Align validator + tests with the documented/CRD enum values to avoid breaking Kubernetes configs.
validModes := []string{"fail", "bestEffort"}
if !contains(validModes, fh.PartialFailureMode) {
return fmt.Errorf("partialFailureMode must be one of: %s", strings.Join(validModes, ", "))
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go
Outdated
Show resolved
Hide resolved
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go
Outdated
Show resolved
Hide resolved
7c92d54 to
b1533af
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b1533af to
fae7c45
Compare
@yrobla This is correct. However, I don't think we should to do this immediately in a follow-up. This has always been a gap. For example, if the underlying MCPServer capabilities change, the vMCP will not update the capabilities within a session. Fixing this will require mutating session-scoped state and notifying the MCP client. We can add this functionality today, but I think we'd be better off improving the existing session code first (i.e. this WIP RFC: stacklok/toolhive-rfcs#39). What do you think? |
I can see some relationship if we check aggregation at session level, instead of doing that at initialize level. This will help. But also, this will not be periodical, is that what you meant relating that to sesions? |
I think so, but to add a bit more detail I'll revise the domain model I had suggested: // Session represents an active MCP session with all its capabilities and resources.
// This is a pure domain object - no protocol concerns.
type Session interface {
ID() string
// Capabilities - returns discovered tools/resources for this session
// PREVIOIUSLY, capabilities were static.
// Tools() []Tool
// Resources() []Resource
// Capabilities() returns a channel containing the up to date capabilities within the session.
// The channel is guaranteed to hold the initial set of capabilities on construction.
// The channel will receive updated capabilities if the underlying MCP capabilities change
// (e.g. circuit breaker, backend revised capabilities).
// The channel capacity is 1 to guarantee updates are consumed in order.
Capabilities() <-chan Capabilities
// MCP operations - routing logic is encapsulated here
CallTool(ctx context.Context, name string, arguments map[string]any) (*ToolResult, error)
ReadResource(ctx context.Context, uri string) (*ResourceResult, error)
GetPrompt(ctx context.Context, name string, arguments map[string]any) (*PromptResult, error)
// Lifecycle
Close() error
}The |
Implement backend health filtering to prevent exposing tools from unavailable backends when circuit breakers are open, addressing issue #3036 acceptance criterion.
When backends become unhealthy (circuit breaker open, authentication failures, unknown status), their tools are automatically excluded from the aggregated capabilities returned to clients. When backends recover, tools are automatically restored on the next session initialization.
Changes:
Related-to: #3036