Skip to content

Exclude unhealthy backends from vMCP capability aggregation#3642

Open
yrobla wants to merge 3 commits intomainfrom
issue-3036-v1-fix
Open

Exclude unhealthy backends from vMCP capability aggregation#3642
yrobla wants to merge 3 commits intomainfrom
issue-3036-v1-fix

Conversation

@yrobla
Copy link
Contributor

@yrobla yrobla commented Feb 6, 2026

Implement backend health filtering to prevent exposing tools from unavailable backends when circuit breakers are open, addressing issue #3036 acceptance criterion.

When backends become unhealthy (circuit breaker open, authentication failures, unknown status), their tools are automatically excluded from the aggregated capabilities returned to clients. When backends recover, tools are automatically restored on the next session initialization.

Changes:

  • Add filterHealthyBackends() in discovery middleware to exclude unhealthy/unknown/unauthenticated backends from aggregation
  • Include healthy and degraded backends (operational but slow)
  • Move circuit breaker and health check validation from commands.go to pkg/vmcp/config/validator.go for better separation of concerns
  • Add comprehensive unit tests for backend filtering (8 test cases)
  • Add validation tests for failure handling config (11 test cases)
  • Update E2E circuit breaker test with capability filtering notes

Related-to: #3036

Implement backend health filtering to prevent exposing tools from
unavailable backends when circuit breakers are open, addressing
issue #3036 acceptance criterion.

When backends become unhealthy (circuit breaker open, authentication
failures, unknown status), their tools are automatically excluded from
the aggregated capabilities returned to clients. When backends recover,
tools are automatically restored on the next session initialization.

Changes:
  - Add filterHealthyBackends() in discovery middleware to exclude
    unhealthy/unknown/unauthenticated backends from aggregation
  - Include healthy and degraded backends (operational but slow)
  - Move circuit breaker and health check validation from commands.go
    to pkg/vmcp/config/validator.go for better separation of concerns
  - Add comprehensive unit tests for backend filtering (8 test cases)
  - Add validation tests for failure handling config (11 test cases)
  - Update E2E circuit breaker test with capability filtering notes

Related-to: #3036
@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Feb 6, 2026
@yrobla yrobla requested a review from Copilot February 6, 2026 09:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements backend health filtering for vMCP capability aggregation, ensuring that tools from unhealthy backends are automatically excluded when their circuit breakers open or when they experience authentication failures. The implementation filters backends at session initialization time, and the filtered capabilities are cached for the session lifetime.

Changes:

  • Add filterHealthyBackends() function to exclude unhealthy/unknown/unauthenticated backends while including healthy and degraded backends in capability aggregation
  • Move circuit breaker and health check validation logic from commands.go to validator.go for better separation of concerns
  • Add comprehensive test coverage for both filtering logic (8 test cases) and validation logic (11 test cases)

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/vmcp/discovery/middleware.go Adds filterHealthyBackends() function that filters backends by health status before capability discovery
pkg/vmcp/discovery/middleware_test.go Adds TestFilterHealthyBackends with 8 comprehensive test cases; updates existing tests to set HealthStatus field
pkg/vmcp/config/validator.go Moves validation logic from commands.go and enhances error messages with actual values
pkg/vmcp/config/validator_test.go Adds TestValidator_ValidateFailureHandling with 11 test cases covering all validation scenarios
cmd/vmcp/app/commands.go Removes duplicate validation logic now handled by validator.go
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go Adds documentation notes explaining capability filtering behavior for circuit breaker scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 96.82540% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.05%. Comparing base (dad2bb8) to head (fae7c45).
⚠️ Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
pkg/vmcp/health/monitor.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3642      +/-   ##
==========================================
+ Coverage   65.94%   66.05%   +0.10%     
==========================================
  Files         413      415       +2     
  Lines       40959    41186     +227     
==========================================
+ Hits        27010    27204     +194     
- Misses      11858    11886      +28     
- Partials     2091     2096       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Feb 6, 2026
@yrobla yrobla requested a review from Copilot February 6, 2026 10:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the issue-3036-v1-fix branch from 2a4a2a0 to e225b73 Compare February 6, 2026 10:18
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 6, 2026
@JAORMX
Copy link
Collaborator

JAORMX commented Feb 6, 2026

Is aggregation done periodically? What if there's a temporary downtime in one of the servers? will they be excluded?

@yrobla
Copy link
Contributor Author

yrobla commented Feb 6, 2026

Is aggregation done periodically? What if there's a temporary downtime in one of the servers? will they be excluded?

No, aggregation happens at MCP initialization, not periodically. Capability aggregation happens during MCP protocol initialization (initialize request), then results are cached

Once capabilities are discovered at initialization, they're cached. If a backend
goes down after initialization, its tools remain in the capability list but will fail when
called. The health filtering only prevents unhealthy backends from being included at initialization time.

I can change to periodical in a follow-up

@yrobla yrobla requested a review from Copilot February 6, 2026 15:05
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 6, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

pkg/vmcp/config/validator.go:355

  • PartialFailureMode validation uses values {"fail", "bestEffort"}, but the config model/CRD docs describe the allowed value as "best_effort" (underscore). As written, a valid CRD/YAML value of "best_effort" will fail validation, and the added tests also appear to be asserting the camelCase form instead of the documented/CRD form. Align validator + tests with the documented/CRD enum values to avoid breaking Kubernetes configs.
	validModes := []string{"fail", "bestEffort"}
	if !contains(validModes, fh.PartialFailureMode) {
		return fmt.Errorf("partialFailureMode must be one of: %s", strings.Join(validModes, ", "))
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the issue-3036-v1-fix branch from 7c92d54 to b1533af Compare February 6, 2026 15:27
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 6, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the issue-3036-v1-fix branch from b1533af to fae7c45 Compare February 6, 2026 15:35
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Feb 6, 2026
@jerm-dro
Copy link
Contributor

jerm-dro commented Feb 6, 2026

excluded?

No, aggregation happens at MCP initialization, not periodically. Capability aggregation happens during MCP protocol initialization (initialize request), then results are cached

Once capabilities are discovered at initialization, they're cached. If a backend goes down after initialization, its tools remain in the capability list but will fail when called. The health filtering only prevents unhealthy backends from being included at initialization time.

I can change to periodical in a follow-up

@yrobla This is correct. However, I don't think we should to do this immediately in a follow-up. This has always been a gap. For example, if the underlying MCPServer capabilities change, the vMCP will not update the capabilities within a session.

Fixing this will require mutating session-scoped state and notifying the MCP client. We can add this functionality today, but I think we'd be better off improving the existing session code first (i.e. this WIP RFC: stacklok/toolhive-rfcs#39). What do you think?

@yrobla
Copy link
Contributor Author

yrobla commented Feb 6, 2026

excluded?

No, aggregation happens at MCP initialization, not periodically. Capability aggregation happens during MCP protocol initialization (initialize request), then results are cached
Once capabilities are discovered at initialization, they're cached. If a backend goes down after initialization, its tools remain in the capability list but will fail when called. The health filtering only prevents unhealthy backends from being included at initialization time.
I can change to periodical in a follow-up

@yrobla This is correct. However, I don't think we should to do this immediately in a follow-up. This has always been a gap. For example, if the underlying MCPServer capabilities change, the vMCP will not update the capabilities within a session.

Fixing this will require mutating session-scoped state and notifying the MCP client. We can add this functionality today, but I think we'd be better off improving the existing session code first (i.e. this WIP RFC: stacklok/toolhive-rfcs#39). What do you think?

I can see some relationship if we check aggregation at session level, instead of doing that at initialize level. This will help. But also, this will not be periodical, is that what you meant relating that to sesions?

@jerm-dro
Copy link
Contributor

jerm-dro commented Feb 6, 2026

I can see some relationship if we check aggregation at session level, instead of doing that at initialize level. This will help. But also, this will not be periodical, is that what you meant relating that to sesions?

I think so, but to add a bit more detail I'll revise the domain model I had suggested:

// Session represents an active MCP session with all its capabilities and resources.
// This is a pure domain object - no protocol concerns.
type Session interface {
    ID() string

    // Capabilities - returns discovered tools/resources for this session
    // PREVIOIUSLY, capabilities were static.
    // Tools() []Tool
    // Resources() []Resource
    
    // Capabilities() returns a channel containing the up to date capabilities within the session.
    // The channel is guaranteed to hold the initial set of capabilities on construction.
    // The channel will receive updated capabilities if the underlying MCP capabilities change
    // (e.g. circuit breaker, backend revised capabilities). 
    // The channel capacity is 1 to guarantee updates are consumed in order.
    Capabilities() <-chan Capabilities

    // MCP operations - routing logic is encapsulated here
    CallTool(ctx context.Context, name string, arguments map[string]any) (*ToolResult, error)
    ReadResource(ctx context.Context, uri string) (*ResourceResult, error)
    GetPrompt(ctx context.Context, name string, arguments map[string]any) (*PromptResult, error)

    // Lifecycle
    Close() error
}

The Capabilities method would encapsulate the "how do we recompute capabilities." Periodically re-running the same function in a goroutine would be fine. The caller handles the "how do we notify the MCP client of updated capabilities."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants