Skip to content

fix(#3448) - vMCP operator deterministically orders servers#3450

Merged
jerm-dro merged 2 commits intomainfrom
jerm/2026-01-26-fix-deployment-loop
Jan 26, 2026
Merged

fix(#3448) - vMCP operator deterministically orders servers#3450
jerm-dro merged 2 commits intomainfrom
jerm/2026-01-26-fix-deployment-loop

Conversation

@jerm-dro
Copy link
Contributor

@jerm-dro jerm-dro commented Jan 26, 2026

Summary

Fixes #3448

A very subtle bug introduced in #3235 caused the vMCP reconciler to repeatedly update the associated deployment, because the config.Backends ordering was non-deterministic. This PR makes the ordering of backends deterministic to not cause false positives when asking "should the deployment be updated?"

Details

When you have more than one MCPServer in the group, the set of backends is no longer deterministic, resulting in changing config hashes:

Reconcile 7906957f: ["oci-registry","osv","context7","fetch","mcp-optimizer"] → checksum b6e5785ea232978c...
Reconcile 8db5f52f: ["fetch","mcp-optimizer","oci-registry","osv","context7"] → checksum 10974f84647d70fc...
Reconcile ed995171: ["mcp-optimizer","oci-registry","osv","context7","fetch"] → checksum 178c41a468858553...
Reconcile 03212735: ["oci-registry","osv","context7","fetch","mcp-optimizer"] → checksum b6e5785ea232978c...
Reconcile 8b93d4d6: ["osv","context7","fetch","mcp-optimizer","oci-registry"] → checksum 5d16a52b6ce144a1...

This PR teaches the Discoverer to return the same order of backends by explicitly sorting the results. Now, the config hashes are the same for the same set of backends and the redeploy loop is not triggered.

Before #3235, this was not a problem, because the config.Backends was not populated and instead gathered at runtime.

Testing

A unit test was added to verify the new Discoverer behavior.
Manual testing validates the demo environment deploys successfully again without looping:
Screenshot 2026-01-26 at 11 56 08 AM

Additional Context

I explored implementing this in a few places, but ultimately landed on the Discoverer. This isn't my ideal location, but it is easily unit tested & is the origin of the non-determinism.

Ideally, our checksum calculation would be more resilient to differences in ordering, so a similar issue doesn't happen again. Upon investigation, the checksum already uses some caution around maps:
https://github.com/stacklok/toolhive/blob/jerm/2026-01-26-fix-deployment-loop/cmd/thv-operator/controllers/virtualmcpserver_vmcpconfig.go#L89-L92

Moving forward, I think we should use map when a set is truly unordered. We cannot have the same assumption that arrays are unordered, because changing the order of arrays may change their meaning.

Signed-off-by: Jeremy Drouillard <jeremy@stacklok.com>
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Jan 26, 2026
Signed-off-by: Jeremy Drouillard <jeremy@stacklok.com>
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Jan 26, 2026
@jerm-dro jerm-dro marked this pull request as ready for review January 26, 2026 21:20
@codecov
Copy link

codecov bot commented Jan 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.90%. Comparing base (826c49b) to head (f23b26b).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3450      +/-   ##
==========================================
+ Coverage   64.83%   64.90%   +0.07%     
==========================================
  Files         391      391              
  Lines       38292    38296       +4     
==========================================
+ Hits        24826    24857      +31     
+ Misses      11524    11496      -28     
- Partials     1942     1943       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jerm-dro jerm-dro merged commit 4cf9535 into main Jan 26, 2026
35 checks passed
@jerm-dro jerm-dro deleted the jerm/2026-01-26-fix-deployment-loop branch January 26, 2026 21:57
therealnb pushed a commit that referenced this pull request Jan 27, 2026
Fixes #3448

A very subtle bug introduced in #3235 caused the vMCP reconciler to repeatedly update the associated deployment, because the config.Backends ordering was non-deterministic. This PR makes the ordering of backends deterministic to not cause false positives when asking "should the deployment be updated?"
---------

Signed-off-by: Jeremy Drouillard <jeremy@stacklok.com>
therealnb pushed a commit that referenced this pull request Jan 27, 2026
Moving the singleflight deduplication logic to a separate PR
as it addresses a different race condition from the one fixed in #3450.

The fix prevents duplicate capability aggregation when multiple
concurrent requests arrive simultaneously at startup.
therealnb pushed a commit that referenced this pull request Jan 28, 2026
* Infrastructure improvements and bugfixes for vMCP

- Add OpenTelemetry tracing to capability aggregation
- Add singleflight deduplication for discovery requests
- Add health checker self-check prevention
- Add HTTP client timeout fixes
- Improve E2E test reliability
- Various build and infrastructure improvements

* fix: Update CallTool and GetPrompt signatures to match BackendClient interface

- Add conversion import for meta field handling
- Update CallTool to accept meta parameter and return *vmcp.ToolCallResult
- Update GetPrompt to return *vmcp.PromptGetResult
- Add convertContent helper function

* fix: Update ReadResource signature to match BackendClient interface

- Update ReadResource to return *vmcp.ResourceReadResult instead of []byte
- Extract and include meta field from backend response
- Include MIME type in result

* fix: Pass selfURL parameter to health.NewMonitor

- Construct selfURL from Host, Port, and EndpointPath
- Prevents health checker from checking itself

* Fix NewHealthChecker calls in checker_test.go to include selfURL parameter

* Fix NewMonitor calls in monitor_test.go to include selfURL parameter

All 10 calls to NewMonitor in monitor_test.go were missing the new selfURL parameter that was added to the function signature. This was causing compilation failures in CI.

* Fix Go import formatting issues (gci linter)

Fixed import ordering in:
- pkg/vmcp/client/client.go
- pkg/vmcp/health/checker_test.go
- pkg/vmcp/health/monitor_test.go

* Fix Chart.yaml version - restore to 0.0.103

The version was incorrectly downgraded to 0.0.102. Restore it to 0.0.103 to match main branch.

* Bump Chart.yaml version to 0.0.104

The chart-testing tool requires version bumps to be higher than the base branch version (0.0.103).

* Update README.md version badge to 0.0.104

Match the Chart.yaml version update to satisfy helm-docs pre-commit hook.

* Refactor vMCP tracing and remove health checker self-check

Move telemetry provider initialization earlier in vmcp serve command to
enable distributed tracing in the aggregator. The aggregator now accepts
an explicit tracer provider parameter instead of using the global otel
tracer, following dependency injection best practices.

Improve tracing error handling by using named return values and deferred
error recording in aggregator methods, ensuring errors are properly
captured in traces.

Remove health checker self-check functionality that prevented the server
from checking its own health endpoint. This simplifies the implementation
and removes unnecessary URL normalization logic.

Changes:
- Add tracerProvider parameter to aggregator.NewDefaultAggregator
- Use noop tracer when provider is nil
- Improve span error handling with deferred recording
- Remove selfURL parameter from health.NewHealthChecker
- Delete pkg/vmcp/health/checker_selfcheck_test.go
- Update all tests to match new function signatures
- Add debug logging for auth strategy application in client

* Add explanatory comment for MCP SDK Meta limitations

Restores comment explaining why Meta field preservation is important
for ReadResource, in anticipation of future SDK improvements.

This addresses PR feedback to maintain context about the SDK's
current limitations regarding Meta field handling.

* Update test helper comments to clarify pod readiness contract

- Clarify that checkPodsReady waits for at least one pod (not all pods)
- Add context that helpers are used for single replica deployments
- Update comments on WaitForPodsReady and WaitForVirtualMCPServerReady

Addresses code review feedback from PR review.

* Complete error capture pattern in MergeCapabilities defer

- Add named return value (retErr error) to MergeCapabilities
- Add error capture in defer statement with span.RecordError and span.SetStatus
- Ensures consistent error handling pattern across all aggregator methods

This completes the implementation of the error capture pattern suggested
in code review for all methods with tracing spans.

* Remove singleflight race condition fix

Moving the singleflight deduplication logic to a separate PR
as it addresses a different race condition from the one fixed in #3450.

The fix prevents duplicate capability aggregation when multiple
concurrent requests arrive simultaneously at startup.

* Add SPDX license headers to manager.go
dmjb pushed a commit that referenced this pull request Jan 28, 2026
* Infrastructure improvements and bugfixes for vMCP

- Add OpenTelemetry tracing to capability aggregation
- Add singleflight deduplication for discovery requests
- Add health checker self-check prevention
- Add HTTP client timeout fixes
- Improve E2E test reliability
- Various build and infrastructure improvements

* fix: Update CallTool and GetPrompt signatures to match BackendClient interface

- Add conversion import for meta field handling
- Update CallTool to accept meta parameter and return *vmcp.ToolCallResult
- Update GetPrompt to return *vmcp.PromptGetResult
- Add convertContent helper function

* fix: Update ReadResource signature to match BackendClient interface

- Update ReadResource to return *vmcp.ResourceReadResult instead of []byte
- Extract and include meta field from backend response
- Include MIME type in result

* fix: Pass selfURL parameter to health.NewMonitor

- Construct selfURL from Host, Port, and EndpointPath
- Prevents health checker from checking itself

* Fix NewHealthChecker calls in checker_test.go to include selfURL parameter

* Fix NewMonitor calls in monitor_test.go to include selfURL parameter

All 10 calls to NewMonitor in monitor_test.go were missing the new selfURL parameter that was added to the function signature. This was causing compilation failures in CI.

* Fix Go import formatting issues (gci linter)

Fixed import ordering in:
- pkg/vmcp/client/client.go
- pkg/vmcp/health/checker_test.go
- pkg/vmcp/health/monitor_test.go

* Fix Chart.yaml version - restore to 0.0.103

The version was incorrectly downgraded to 0.0.102. Restore it to 0.0.103 to match main branch.

* Bump Chart.yaml version to 0.0.104

The chart-testing tool requires version bumps to be higher than the base branch version (0.0.103).

* Update README.md version badge to 0.0.104

Match the Chart.yaml version update to satisfy helm-docs pre-commit hook.

* Refactor vMCP tracing and remove health checker self-check

Move telemetry provider initialization earlier in vmcp serve command to
enable distributed tracing in the aggregator. The aggregator now accepts
an explicit tracer provider parameter instead of using the global otel
tracer, following dependency injection best practices.

Improve tracing error handling by using named return values and deferred
error recording in aggregator methods, ensuring errors are properly
captured in traces.

Remove health checker self-check functionality that prevented the server
from checking its own health endpoint. This simplifies the implementation
and removes unnecessary URL normalization logic.

Changes:
- Add tracerProvider parameter to aggregator.NewDefaultAggregator
- Use noop tracer when provider is nil
- Improve span error handling with deferred recording
- Remove selfURL parameter from health.NewHealthChecker
- Delete pkg/vmcp/health/checker_selfcheck_test.go
- Update all tests to match new function signatures
- Add debug logging for auth strategy application in client

* Add explanatory comment for MCP SDK Meta limitations

Restores comment explaining why Meta field preservation is important
for ReadResource, in anticipation of future SDK improvements.

This addresses PR feedback to maintain context about the SDK's
current limitations regarding Meta field handling.

* Update test helper comments to clarify pod readiness contract

- Clarify that checkPodsReady waits for at least one pod (not all pods)
- Add context that helpers are used for single replica deployments
- Update comments on WaitForPodsReady and WaitForVirtualMCPServerReady

Addresses code review feedback from PR review.

* Complete error capture pattern in MergeCapabilities defer

- Add named return value (retErr error) to MergeCapabilities
- Add error capture in defer statement with span.RecordError and span.SetStatus
- Ensures consistent error handling pattern across all aggregator methods

This completes the implementation of the error capture pattern suggested
in code review for all methods with tracing spans.

* Remove singleflight race condition fix

Moving the singleflight deduplication logic to a separate PR
as it addresses a different race condition from the one fixed in #3450.

The fix prevents duplicate capability aggregation when multiple
concurrent requests arrive simultaneously at startup.

* Add SPDX license headers to manager.go
dmjb added a commit that referenced this pull request Jan 28, 2026
* Infrastructure improvements and bugfixes for vMCP (#3439)

* Infrastructure improvements and bugfixes for vMCP

- Add OpenTelemetry tracing to capability aggregation
- Add singleflight deduplication for discovery requests
- Add health checker self-check prevention
- Add HTTP client timeout fixes
- Improve E2E test reliability
- Various build and infrastructure improvements

* fix: Update CallTool and GetPrompt signatures to match BackendClient interface

- Add conversion import for meta field handling
- Update CallTool to accept meta parameter and return *vmcp.ToolCallResult
- Update GetPrompt to return *vmcp.PromptGetResult
- Add convertContent helper function

* fix: Update ReadResource signature to match BackendClient interface

- Update ReadResource to return *vmcp.ResourceReadResult instead of []byte
- Extract and include meta field from backend response
- Include MIME type in result

* fix: Pass selfURL parameter to health.NewMonitor

- Construct selfURL from Host, Port, and EndpointPath
- Prevents health checker from checking itself

* Fix NewHealthChecker calls in checker_test.go to include selfURL parameter

* Fix NewMonitor calls in monitor_test.go to include selfURL parameter

All 10 calls to NewMonitor in monitor_test.go were missing the new selfURL parameter that was added to the function signature. This was causing compilation failures in CI.

* Fix Go import formatting issues (gci linter)

Fixed import ordering in:
- pkg/vmcp/client/client.go
- pkg/vmcp/health/checker_test.go
- pkg/vmcp/health/monitor_test.go

* Fix Chart.yaml version - restore to 0.0.103

The version was incorrectly downgraded to 0.0.102. Restore it to 0.0.103 to match main branch.

* Bump Chart.yaml version to 0.0.104

The chart-testing tool requires version bumps to be higher than the base branch version (0.0.103).

* Update README.md version badge to 0.0.104

Match the Chart.yaml version update to satisfy helm-docs pre-commit hook.

* Refactor vMCP tracing and remove health checker self-check

Move telemetry provider initialization earlier in vmcp serve command to
enable distributed tracing in the aggregator. The aggregator now accepts
an explicit tracer provider parameter instead of using the global otel
tracer, following dependency injection best practices.

Improve tracing error handling by using named return values and deferred
error recording in aggregator methods, ensuring errors are properly
captured in traces.

Remove health checker self-check functionality that prevented the server
from checking its own health endpoint. This simplifies the implementation
and removes unnecessary URL normalization logic.

Changes:
- Add tracerProvider parameter to aggregator.NewDefaultAggregator
- Use noop tracer when provider is nil
- Improve span error handling with deferred recording
- Remove selfURL parameter from health.NewHealthChecker
- Delete pkg/vmcp/health/checker_selfcheck_test.go
- Update all tests to match new function signatures
- Add debug logging for auth strategy application in client

* Add explanatory comment for MCP SDK Meta limitations

Restores comment explaining why Meta field preservation is important
for ReadResource, in anticipation of future SDK improvements.

This addresses PR feedback to maintain context about the SDK's
current limitations regarding Meta field handling.

* Update test helper comments to clarify pod readiness contract

- Clarify that checkPodsReady waits for at least one pod (not all pods)
- Add context that helpers are used for single replica deployments
- Update comments on WaitForPodsReady and WaitForVirtualMCPServerReady

Addresses code review feedback from PR review.

* Complete error capture pattern in MergeCapabilities defer

- Add named return value (retErr error) to MergeCapabilities
- Add error capture in defer statement with span.RecordError and span.SetStatus
- Ensures consistent error handling pattern across all aggregator methods

This completes the implementation of the error capture pattern suggested
in code review for all methods with tracing spans.

* Remove singleflight race condition fix

Moving the singleflight deduplication logic to a separate PR
as it addresses a different race condition from the one fixed in #3450.

The fix prevents duplicate capability aggregation when multiple
concurrent requests arrive simultaneously at startup.

* Add SPDX license headers to manager.go

* Update E2E tests to reflect new registry error behavior

This change updates E2E tests to match the new HTTP status codes and
error messages introduced in the registry API improvements.

Changes:
- Update expected status codes:
  - 502 Bad Gateway: For validation errors (invalid JSON, missing servers)
  - 504 Gateway Timeout: For connectivity errors (unreachable hosts)
- Update expected error messages:
  - "Will use built-in registry" instead of "reset to default"
- Update test for api_url validation:
  - api_url now validates reachability (returns 504 for unreachable hosts)
  - Previously it only validated URL format

Updated tests:
1. "should reset to default with empty request"
   - Expected message: "Will use built-in registry"
2. "should return 502 for invalid JSON file"
   - Expected status: 502 (was 400)
3. "should return 502 for file without servers"
   - Expected status: 502 (was 400)
4. "should return 504 for URL pointing to unreachable host"
   - Expected status: 504 (was 400)
5. "should return 504 for api_url pointing to unreachable host"
   - Expected status: 504 (was 200)
   - Updated test name and comment to reflect new behavior

These changes validate that the registry API now properly distinguishes
between validation errors (502) and connectivity errors (504), providing
better semantics and user experience.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Nigel Brown <nigel@stacklok.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
dmjb added a commit that referenced this pull request Jan 29, 2026
This change consolidates registry configuration logic into a service layer
and improves error handling in both the API and CLI.

Changes:

Service Layer:
- Add RegistryConfigService interface (pkg/config/registry_service.go)
- Consolidates registry configuration operations (SetRegistryFromInput, UnsetRegistry)
- Auto-detects registry type (URL/API/File) and provides user-friendly messages
- Add comprehensive service tests (pkg/config/registry_service_test.go)
- Generate mocks for testing (pkg/config/mocks/mock_registry_service.go)

API Layer (pkg/api/v1/registry.go):
- Map structured errors to proper HTTP status codes:
  - 502 Bad Gateway: Validation errors (invalid JSON, missing servers)
  - 504 Gateway Timeout: Connectivity/timeout errors (unreachable hosts)
- Add isConnectivityError() and isValidationError() helpers
- Refactor updateRegistry() to use RegistryConfigService
- Add timeout integration tests (registry_timeout_test.go)

CLI Layer (cmd/thv/app/config.go):
- Use RegistryConfigService for cleaner code
- Add enhanceRegistryError() for user-friendly error messages
- Provide actionable hints for common failure scenarios
- Map error types to match API status codes (504, 502)

Benefits:
- Single source of truth for registry configuration logic
- Consistent error handling across API and CLI
- Better user experience with actionable error messages
- Easier testing with service abstraction

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* rename type

* Update E2E tests to reflect new registry error behaviour (#3481)

* Infrastructure improvements and bugfixes for vMCP (#3439)

* Infrastructure improvements and bugfixes for vMCP

- Add OpenTelemetry tracing to capability aggregation
- Add singleflight deduplication for discovery requests
- Add health checker self-check prevention
- Add HTTP client timeout fixes
- Improve E2E test reliability
- Various build and infrastructure improvements

* fix: Update CallTool and GetPrompt signatures to match BackendClient interface

- Add conversion import for meta field handling
- Update CallTool to accept meta parameter and return *vmcp.ToolCallResult
- Update GetPrompt to return *vmcp.PromptGetResult
- Add convertContent helper function

* fix: Update ReadResource signature to match BackendClient interface

- Update ReadResource to return *vmcp.ResourceReadResult instead of []byte
- Extract and include meta field from backend response
- Include MIME type in result

* fix: Pass selfURL parameter to health.NewMonitor

- Construct selfURL from Host, Port, and EndpointPath
- Prevents health checker from checking itself

* Fix NewHealthChecker calls in checker_test.go to include selfURL parameter

* Fix NewMonitor calls in monitor_test.go to include selfURL parameter

All 10 calls to NewMonitor in monitor_test.go were missing the new selfURL parameter that was added to the function signature. This was causing compilation failures in CI.

* Fix Go import formatting issues (gci linter)

Fixed import ordering in:
- pkg/vmcp/client/client.go
- pkg/vmcp/health/checker_test.go
- pkg/vmcp/health/monitor_test.go

* Fix Chart.yaml version - restore to 0.0.103

The version was incorrectly downgraded to 0.0.102. Restore it to 0.0.103 to match main branch.

* Bump Chart.yaml version to 0.0.104

The chart-testing tool requires version bumps to be higher than the base branch version (0.0.103).

* Update README.md version badge to 0.0.104

Match the Chart.yaml version update to satisfy helm-docs pre-commit hook.

* Refactor vMCP tracing and remove health checker self-check

Move telemetry provider initialization earlier in vmcp serve command to
enable distributed tracing in the aggregator. The aggregator now accepts
an explicit tracer provider parameter instead of using the global otel
tracer, following dependency injection best practices.

Improve tracing error handling by using named return values and deferred
error recording in aggregator methods, ensuring errors are properly
captured in traces.

Remove health checker self-check functionality that prevented the server
from checking its own health endpoint. This simplifies the implementation
and removes unnecessary URL normalization logic.

Changes:
- Add tracerProvider parameter to aggregator.NewDefaultAggregator
- Use noop tracer when provider is nil
- Improve span error handling with deferred recording
- Remove selfURL parameter from health.NewHealthChecker
- Delete pkg/vmcp/health/checker_selfcheck_test.go
- Update all tests to match new function signatures
- Add debug logging for auth strategy application in client

* Add explanatory comment for MCP SDK Meta limitations

Restores comment explaining why Meta field preservation is important
for ReadResource, in anticipation of future SDK improvements.

This addresses PR feedback to maintain context about the SDK's
current limitations regarding Meta field handling.

* Update test helper comments to clarify pod readiness contract

- Clarify that checkPodsReady waits for at least one pod (not all pods)
- Add context that helpers are used for single replica deployments
- Update comments on WaitForPodsReady and WaitForVirtualMCPServerReady

Addresses code review feedback from PR review.

* Complete error capture pattern in MergeCapabilities defer

- Add named return value (retErr error) to MergeCapabilities
- Add error capture in defer statement with span.RecordError and span.SetStatus
- Ensures consistent error handling pattern across all aggregator methods

This completes the implementation of the error capture pattern suggested
in code review for all methods with tracing spans.

* Remove singleflight race condition fix

Moving the singleflight deduplication logic to a separate PR
as it addresses a different race condition from the one fixed in #3450.

The fix prevents duplicate capability aggregation when multiple
concurrent requests arrive simultaneously at startup.

* Add SPDX license headers to manager.go

* Update E2E tests to reflect new registry error behavior

This change updates E2E tests to match the new HTTP status codes and
error messages introduced in the registry API improvements.

Changes:
- Update expected status codes:
  - 502 Bad Gateway: For validation errors (invalid JSON, missing servers)
  - 504 Gateway Timeout: For connectivity errors (unreachable hosts)
- Update expected error messages:
  - "Will use built-in registry" instead of "reset to default"
- Update test for api_url validation:
  - api_url now validates reachability (returns 504 for unreachable hosts)
  - Previously it only validated URL format

Updated tests:
1. "should reset to default with empty request"
   - Expected message: "Will use built-in registry"
2. "should return 502 for invalid JSON file"
   - Expected status: 502 (was 400)
3. "should return 502 for file without servers"
   - Expected status: 502 (was 400)
4. "should return 504 for URL pointing to unreachable host"
   - Expected status: 504 (was 400)
5. "should return 504 for api_url pointing to unreachable host"
   - Expected status: 504 (was 200)
   - Updated test name and comment to reflect new behavior

These changes validate that the registry API now properly distinguishes
between validation errors (502) and connectivity errors (504), providing
better semantics and user experience.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vMCP pods cycle multiple times during creation since operator v0.5.26

2 participants