Skip to content

fix: enable cross-endpoint remote calls when required#196

Open
deanq wants to merge 77 commits intomainfrom
deanq/ae-2079-fix-bad-local-calls
Open

fix: enable cross-endpoint remote calls when required#196
deanq wants to merge 77 commits intomainfrom
deanq/ae-2079-fix-bad-local-calls

Conversation

@deanq
Copy link
Member

@deanq deanq commented Feb 12, 2026

Summary

Fixes "No module named 'torch'" error when @remote functions are called in deployments. The decorator now correctly makes HTTP calls to worker containers instead of executing locally.

This PR includes the complete fix plus supporting infrastructure for API key management and endpoint discovery optimization.

Changes

Main Fix (AE-2079)

  • Root Cause: ServiceRegistry couldn't discover endpoint URLs, causing @remote decorator to fall back to local execution
  • Solution: Pass resources_endpoints via FLASH_RESOURCES_ENDPOINTS env var and load endpoint URLs from environment
  • Result: Remote calls in deployed mode now use HTTP to worker containers, eliminating torch import errors in mothership container

Supporting Features

  1. API Key Management

    • ServiceRegistry checks makes_remote_calls to skip unnecessary State Manager queries for local-only endpoints
    • ServerlessResource injects RUNPOD_API_KEY for QB endpoints at deploy time
    • Comprehensive API key management documentation added
  2. Resource Name Normalization

    • Strip -fb suffix and live- prefix from resource names for manifest lookups
    • Ensures resources with these naming patterns are properly matched
  3. Integration Tests

    • 6 tests for API key injection behavior (QB vs LB, with/without remote calls)
    • 6 tests for State Manager optimization (local-only vs remote-capable endpoints)
    • All 12 tests passing

Files Changed

  • Core preview mode fixes: preview.py, service_registry.py, models.py
  • API key injection: serverless.py, load_balancer_sls_resource.py
  • Client updates: client.py
  • Documentation: docs/API_Key_Management.md, CLAUDE.md
  • Tests: test_api_key_injection.py, test_local_only_endpoint.py

Test Plan

  • ✅ All 12 new integration tests passing
  • ✅ Existing test suite passing (474 tests)
  • ✅ Quality checks passing (format, lint, type)
  • ✅ Verified remote calls work in preview mode without torch errors

Related

@deanq deanq changed the title fix(preview): enable cross-endpoint remote calls in preview mode fix: enable cross-endpoint remote calls in preview mode Feb 12, 2026
@deanq deanq changed the title fix: enable cross-endpoint remote calls in preview mode fix: enable cross-endpoint remote calls when required Feb 12, 2026
deanq added 18 commits February 12, 2026 17:11
Optimizes cross-endpoint communication by skipping State Manager queries
for local-only endpoints and injecting API keys only when needed.

- ServiceRegistry checks makes_remote_calls to skip unnecessary queries
- ServerlessResource injects RUNPOD_API_KEY for QB endpoints at deploy time
- Added comprehensive API key management documentation
Strip -fb suffix and live- prefix from resource names when looking up
configuration in manifest to ensure resources with these naming patterns
are properly matched.
Fixes 'No module named 'torch'' error when @Remote functions are called
in preview deployments. The decorator now correctly makes HTTP calls to
worker containers instead of executing locally.

Root cause: ServiceRegistry couldn't discover endpoint URLs in preview mode,
causing @Remote decorator to fall back to local execution.

Solution:
- Pass resources_endpoints via FLASH_RESOURCES_ENDPOINTS env var
- Load endpoint URLs from environment in ServiceRegistry
- Add custom_endpoint_url field to LoadBalancerSlsResource
- Map queue-based to load-balancer images in preview mode
- Skip template validation when connecting to existing endpoints
- Integrate ServiceRegistry into @Remote decorator wrapper
- Add resources_endpoints field to Manifest model

Verification:
- Remote calls in preview mode now use HTTP to worker containers
- No more torch import errors in mothership container
- Proper endpoint URL construction for Docker DNS

Related: AE-2079
…ptimization

Add comprehensive integration tests for:
- API key injection during QB endpoint deployment (makes_remote_calls flag)
- State Manager query optimization for local-only endpoints
- Safe defaults and edge case handling for both features

Test coverage:
- 6 tests for API key injection behavior (QB vs LB, with/without remote calls)
- 6 tests for State Manager optimization (local-only vs remote-capable endpoints)

All tests passing (12/12).

Related: feat(runtime) commit 32dd688, fix(serverless) commit 556106d
Adds Flash environment ID as environment variable when creating resources.
This enables runtime queries to State Manager using the correct ID.

Part of ServiceRegistry empty endpoints fix.
Injects FLASH_ENVIRONMENT_ID for both QB and LB endpoints to enable
State Manager queries at runtime.

API key injection remains QB-only (env var). LB endpoints use
Authorization headers from requests, not environment variables.

Part of ServiceRegistry empty endpoints fix.
Changes runtime State Manager queries to use FLASH_ENVIRONMENT_ID
(with fallback to RUNPOD_ENDPOINT_ID) instead of just RUNPOD_ENDPOINT_ID.

Root cause: Manifest stored under Flash environment ID, not individual
endpoint ID. Using wrong ID resulted in empty resources_endpoints.

Adds diagnostic logging to help troubleshoot manifest retrieval issues.

Fixes: Load balancer endpoints unable to discover child endpoints.
Logs endpoint URLs being uploaded to State Manager during deployment.
Helps troubleshoot manifest synchronization issues between CLI and runtime.

Part of ServiceRegistry empty endpoints fix.
Logs State Manager responses including presence of resources_endpoints.
Helps troubleshoot runtime manifest loading issues.

Part of ServiceRegistry empty endpoints fix.
Adds makes_remote_calls: True to load-balancer and mothership manifest
entries. This ensures ServiceRegistry correctly identifies these endpoint
types as needing State Manager queries for cross-endpoint routing.

Part of ServiceRegistry empty endpoints fix.
Adds safe-default handling when makes_remote_calls is null/None in manifest.
Returns True (allow remote calls) to ensure functionality works even with
incomplete manifest data.

Also adds unit tests for _check_makes_remote_calls():
- Returns True when makes_remote_calls is True
- Returns False when makes_remote_calls is False
- Returns True (safe default) when makes_remote_calls is null
- Returns True (safe default) when resource not found in manifest

Part of ServiceRegistry empty endpoints fix.
Removes runtime reconciliation pattern where mothership endpoints would
auto-provision child endpoints on startup. This is replaced by CLI-time
provisioning during `flash deploy`.

Changes:
- Remove reconcile_children(), is_mothership(), get_mothership_url()
- Remove runtime startup reconciliation hook from LB handler
- Remove is_mothership flag and FLASH_IS_MOTHERSHIP env var
- Remove is_explicit flag from manifest (no longer needed)
- Delete integration tests for reconciliation behavior
- Update unit tests to not expect removed flags
- Update deployment.py, preview.py, deploy.py to check resource name

All resources are now provisioned upfront by the CLI, enabling a simpler
peer-to-peer architecture where endpoints discover each other via State
Manager using FLASH_ENVIRONMENT_ID.

Part of ServiceRegistry empty endpoints fix.
…ning

Updates comments and documentation in test-mothership command to clarify
that it uses CLI-time provisioning (not runtime reconciliation).

Changes:
- Replace 'auto-provisioning' with 'CLI-time provisioning'
- Update objective text to remove reconciliation references
- Clarify that all provisioning happens before server starts
- Remove outdated --auto-provision flag references

The test-mothership command behavior is unchanged - it still provisions
resources upfront before starting the server. This commit just fixes
misleading terminology that suggested runtime reconciliation.
The test-mothership command was designed to test runtime reconciliation,
which was removed in commits 880c905 and 881a72f. The command is now
obsolete as it only tests CLI provisioning in Docker, which is already
covered by 'flash deploy --preview'.

Changes:
- Delete test_mothership.py command file (457 lines)
- Remove FLASH_IS_TEST_MOTHERSHIP flag handling from mothership_provisioner.py
- Simplify resource naming logic (removed tmp- prefix logic)
- Update comments in manifest.py and scanner.py
- Remove documentation references from Flash_Deploy_Guide.md and VERIFICATION.md

Users should use 'flash deploy --preview' for local testing instead.

Part of ServiceRegistry empty endpoints fix.
…eploy

Removing is_mothership from the manifest (880c905) caused config mismatch
between local and state manifests. This triggered unnecessary re-provisioning
of the mothership on every deploy. When re-provisioning an unchanged endpoint,
RunPod returns endpoint_url=None, so the URL is never written to
resources_endpoints, breaking the post-deploy guidance display.

- Restore is_mothership: True in _create_mothership_resource()
- Restore is_mothership: True in _create_mothership_from_explicit()
- Restore is_mothership flag check in _display_post_deployment_guidance()
  alongside the existing name-based check for forwards compatibility
Post-deploy guidance wasn't showing routes because it only iterated over
resources_endpoints dict, which is null/empty until after deployment
provisions resources. Routes exist in the manifest's routes section but
were never displayed.

Add fallback logic to search resources dict for mothership and extract
routes from manifest, even when resources_endpoints is empty. This allows
routes to be displayed pre-deployment and handles custom mothership names
via the is_mothership flag.

- Add fallback loop after primary resources_endpoints check
- Handles both standard ("mothership") and custom names via is_mothership
- Preserves existing behavior when resources_endpoints is populated
Mothership routes were not being added to manifest's routes dict when
using custom naming patterns (e.g., '03_05_load_balancer-mothership').
The route extraction logic only checked for exact names 'mothership'
or 'mothership-entrypoint', missing projects with {project_name}-mothership
naming.

This caused post-deploy guidance to show 'No routes found in manifest'
even though routes were defined in the mothership's functions array.

Change: Use is_mothership flag instead of exact name matching to extract
routes from all mothership resources regardless of naming pattern.

- Replace hardcoded name check with is_mothership flag
- Ensures routes extracted for custom-named mothership resources
- Aligns with how deploy.py already identifies mothership
- Single line change with broad compatibility impact
@deanq deanq force-pushed the deanq/ae-2079-fix-bad-local-calls branch from d7ac644 to ae12502 Compare February 13, 2026 01:18
deanq and others added 3 commits February 12, 2026 21:34
FastAPI route handlers were never analyzed for calls to @Remote functions,
causing makes_remote_calls flag to always be False even when mothership
called remote endpoints.

Changes:
- scanner.py: Add remote_function_names parameter to detect_main_app()
- scanner.py: Add _analyze_fastapi_routes_for_remote_calls() function
- scanner.py: Add _analyze_function_calls_for_route() function
- manifest.py: Pass remote_function_names to detect_main_app()
- manifest.py: Calculate makes_remote_calls from analyzed routes

Impact:
- Mothership with remote calls: makes_remote_calls=True (gets API key)
- Mothership without remote calls: makes_remote_calls=False (skips State Manager)
- Enables proper API key injection for QB workers
- Enables service registry optimization

Fixes: AE-2079
…rn type

Cleanup unused parameter and fix return value bug in deployment flow.

Changes:
- Remove mothership_url parameter from create_resource_from_manifest()
- Fix deploy_to_environment() to return resources_endpoints dict instead of deploy API response
- Update return type annotation from Dict[str, Any] to Dict[str, str]
- Add comprehensive docstring with Args/Returns/Raises
- Update tests to match new return type expectations

Impact:
- Cleaner API (no dead parameters)
- Correct return values (endpoint URLs instead of API response)
- Better type safety and documentation
@deanq deanq force-pushed the deanq/ae-2079-fix-bad-local-calls branch from 8b326ca to b906427 Compare February 13, 2026 07:23
Add API key injection in preview mode for resources configured with
makes_remote_calls=True. This ensures workers can authenticate remote
calls to other endpoints.

Changes:
- Parse makes_remote_calls flag from manifest
- Inject RUNPOD_API_KEY env var during container startup
- Warn if API key needed but not set
- Add integration tests for all three PRD scenarios:
  1. Mothership → GPU worker
  2. Mothership → CPU worker → GPU worker
  3. Mothership local-only

Tests validate resource parsing, endpoint routing, API key context,
and environment variable fallback behavior.
@deanq deanq force-pushed the deanq/ae-2079-fix-bad-local-calls branch from b906427 to efac58d Compare February 13, 2026 07:26
…ment

Fix API key propagation for load-balanced endpoints by checking request
context before falling back to environment variables in three locations:

1. RunpodGraphQLClient.__init__() - State Manager and GraphQL clients
2. RunpodRestClient.__init__() - REST API clients
3. ServerlessResource.endpoint property - runpod-python SDK endpoint client

Changes establish unified priority: explicit param > request context > env var

This fixes remote calls and State Manager queries for load-balanced endpoints
that receive API keys via Authorization header rather than environment variables.

All three components now automatically pick up per-request API keys from context,
enabling multi-endpoint scenarios to work correctly.

Fixes scenarios:
- Mothership (LB) + GPU Worker (QB)
- Mothership (LB) + CPU Worker (QB) + GPU Worker (QB)
- Mothership (LB) only (local execution)
- Add FLASH_IS_MOTHERSHIP flag to distinguish mothership from worker endpoints
- Extract main_file and app_variable from manifest for runtime provisioning
- Store only filename (not absolute path) for container portability
- Set FLASH_MAIN_FILE and FLASH_APP_VARIABLE in provisioner environment

Fixes deployed mothership endpoints booting in child endpoint mode.
Mothership now correctly loads user code from main.py at runtime.
Enhances API key propagation across three layers:

1. Middleware (lb_handler.py): Store keys in both contextvars and request.state
   for better propagation to async functions and explicit access.

2. Service Registry (service_registry.py): Retrieve API key from request
   context and pass explicitly to State Manager client.

3. State Manager Client (state_manager_client.py): Accept explicit api_key
   parameter at instance-level and per-request, with fallback to context/env.

4. Serverless Resource (serverless.py): Use request context API key for
   LB endpoint deployments during request processing.

This ensures endpoints making remote calls have authenticated access to:
- Other endpoints in the Flash application
- State Manager GraphQL API for service discovery

Tests updated/added to validate API key propagation through all layers.

Fixes: AE-2079
deanq added 30 commits February 13, 2026 11:15
Adds makes_remote_calls: True to load-balancer and mothership manifest
entries. This ensures ServiceRegistry correctly identifies these endpoint
types as needing State Manager queries for cross-endpoint routing.

Part of ServiceRegistry empty endpoints fix.
Adds safe-default handling when makes_remote_calls is null/None in manifest.
Returns True (allow remote calls) to ensure functionality works even with
incomplete manifest data.

Also adds unit tests for _check_makes_remote_calls():
- Returns True when makes_remote_calls is True
- Returns False when makes_remote_calls is False
- Returns True (safe default) when makes_remote_calls is null
- Returns True (safe default) when resource not found in manifest

Part of ServiceRegistry empty endpoints fix.
Removes runtime reconciliation pattern where mothership endpoints would
auto-provision child endpoints on startup. This is replaced by CLI-time
provisioning during `flash deploy`.

Changes:
- Remove reconcile_children(), is_mothership(), get_mothership_url()
- Remove runtime startup reconciliation hook from LB handler
- Remove is_mothership flag and FLASH_IS_MOTHERSHIP env var
- Remove is_explicit flag from manifest (no longer needed)
- Delete integration tests for reconciliation behavior
- Update unit tests to not expect removed flags
- Update deployment.py, preview.py, deploy.py to check resource name

All resources are now provisioned upfront by the CLI, enabling a simpler
peer-to-peer architecture where endpoints discover each other via State
Manager using FLASH_ENVIRONMENT_ID.

Part of ServiceRegistry empty endpoints fix.
…ning

Updates comments and documentation in test-mothership command to clarify
that it uses CLI-time provisioning (not runtime reconciliation).

Changes:
- Replace 'auto-provisioning' with 'CLI-time provisioning'
- Update objective text to remove reconciliation references
- Clarify that all provisioning happens before server starts
- Remove outdated --auto-provision flag references

The test-mothership command behavior is unchanged - it still provisions
resources upfront before starting the server. This commit just fixes
misleading terminology that suggested runtime reconciliation.
The test-mothership command was designed to test runtime reconciliation,
which was removed in commits 880c905 and 881a72f. The command is now
obsolete as it only tests CLI provisioning in Docker, which is already
covered by 'flash deploy --preview'.

Changes:
- Delete test_mothership.py command file (457 lines)
- Remove FLASH_IS_TEST_MOTHERSHIP flag handling from mothership_provisioner.py
- Simplify resource naming logic (removed tmp- prefix logic)
- Update comments in manifest.py and scanner.py
- Remove documentation references from Flash_Deploy_Guide.md and VERIFICATION.md

Users should use 'flash deploy --preview' for local testing instead.

Part of ServiceRegistry empty endpoints fix.
…eploy

Removing is_mothership from the manifest (880c905) caused config mismatch
between local and state manifests. This triggered unnecessary re-provisioning
of the mothership on every deploy. When re-provisioning an unchanged endpoint,
RunPod returns endpoint_url=None, so the URL is never written to
resources_endpoints, breaking the post-deploy guidance display.

- Restore is_mothership: True in _create_mothership_resource()
- Restore is_mothership: True in _create_mothership_from_explicit()
- Restore is_mothership flag check in _display_post_deployment_guidance()
  alongside the existing name-based check for forwards compatibility
Post-deploy guidance wasn't showing routes because it only iterated over
resources_endpoints dict, which is null/empty until after deployment
provisions resources. Routes exist in the manifest's routes section but
were never displayed.

Add fallback logic to search resources dict for mothership and extract
routes from manifest, even when resources_endpoints is empty. This allows
routes to be displayed pre-deployment and handles custom mothership names
via the is_mothership flag.

- Add fallback loop after primary resources_endpoints check
- Handles both standard ("mothership") and custom names via is_mothership
- Preserves existing behavior when resources_endpoints is populated
Mothership routes were not being added to manifest's routes dict when
using custom naming patterns (e.g., '03_05_load_balancer-mothership').
The route extraction logic only checked for exact names 'mothership'
or 'mothership-entrypoint', missing projects with {project_name}-mothership
naming.

This caused post-deploy guidance to show 'No routes found in manifest'
even though routes were defined in the mothership's functions array.

Change: Use is_mothership flag instead of exact name matching to extract
routes from all mothership resources regardless of naming pattern.

- Replace hardcoded name check with is_mothership flag
- Ensures routes extracted for custom-named mothership resources
- Aligns with how deploy.py already identifies mothership
- Single line change with broad compatibility impact
FastAPI route handlers were never analyzed for calls to @Remote functions,
causing makes_remote_calls flag to always be False even when mothership
called remote endpoints.

Changes:
- scanner.py: Add remote_function_names parameter to detect_main_app()
- scanner.py: Add _analyze_fastapi_routes_for_remote_calls() function
- scanner.py: Add _analyze_function_calls_for_route() function
- manifest.py: Pass remote_function_names to detect_main_app()
- manifest.py: Calculate makes_remote_calls from analyzed routes

Impact:
- Mothership with remote calls: makes_remote_calls=True (gets API key)
- Mothership without remote calls: makes_remote_calls=False (skips State Manager)
- Enables proper API key injection for QB workers
- Enables service registry optimization

Fixes: AE-2079
…rn type

Cleanup unused parameter and fix return value bug in deployment flow.

Changes:
- Remove mothership_url parameter from create_resource_from_manifest()
- Fix deploy_to_environment() to return resources_endpoints dict instead of deploy API response
- Update return type annotation from Dict[str, Any] to Dict[str, str]
- Add comprehensive docstring with Args/Returns/Raises
- Update tests to match new return type expectations

Impact:
- Cleaner API (no dead parameters)
- Correct return values (endpoint URLs instead of API response)
- Better type safety and documentation
Add API key injection in preview mode for resources configured with
makes_remote_calls=True. This ensures workers can authenticate remote
calls to other endpoints.

Changes:
- Parse makes_remote_calls flag from manifest
- Inject RUNPOD_API_KEY env var during container startup
- Warn if API key needed but not set
- Add integration tests for all three PRD scenarios:
  1. Mothership → GPU worker
  2. Mothership → CPU worker → GPU worker
  3. Mothership local-only

Tests validate resource parsing, endpoint routing, API key context,
and environment variable fallback behavior.
…ment

Fix API key propagation for load-balanced endpoints by checking request
context before falling back to environment variables in three locations:

1. RunpodGraphQLClient.__init__() - State Manager and GraphQL clients
2. RunpodRestClient.__init__() - REST API clients
3. ServerlessResource.endpoint property - runpod-python SDK endpoint client

Changes establish unified priority: explicit param > request context > env var

This fixes remote calls and State Manager queries for load-balanced endpoints
that receive API keys via Authorization header rather than environment variables.

All three components now automatically pick up per-request API keys from context,
enabling multi-endpoint scenarios to work correctly.

Fixes scenarios:
- Mothership (LB) + GPU Worker (QB)
- Mothership (LB) + CPU Worker (QB) + GPU Worker (QB)
- Mothership (LB) only (local execution)
- Add FLASH_IS_MOTHERSHIP flag to distinguish mothership from worker endpoints
- Extract main_file and app_variable from manifest for runtime provisioning
- Store only filename (not absolute path) for container portability
- Set FLASH_MAIN_FILE and FLASH_APP_VARIABLE in provisioner environment

Fixes deployed mothership endpoints booting in child endpoint mode.
Mothership now correctly loads user code from main.py at runtime.
Enhances API key propagation across three layers:

1. Middleware (lb_handler.py): Store keys in both contextvars and request.state
   for better propagation to async functions and explicit access.

2. Service Registry (service_registry.py): Retrieve API key from request
   context and pass explicitly to State Manager client.

3. State Manager Client (state_manager_client.py): Accept explicit api_key
   parameter at instance-level and per-request, with fallback to context/env.

4. Serverless Resource (serverless.py): Use request context API key for
   LB endpoint deployments during request processing.

This ensures endpoints making remote calls have authenticated access to:
- Other endpoints in the Flash application
- State Manager GraphQL API for service discovery

Tests updated/added to validate API key propagation through all layers.

Fixes: AE-2079
…ainers

Prevent @Remote decorator and remote classes from attempting to deploy resources
at runtime in deployed environments. Resources must be deployed via 'flash deploy'
CLI command before runtime execution.

Changes:
- Guard @Remote decorator wrapper with deployment context check
- Guard remote class initialization with deployment context check
- Add optional allow_runtime_deployment safeguard to ResourceManager (defense-in-depth)
- Raise clear error messages when endpoints missing from manifest in deployed env
- Preserve local development behavior (on-demand deployment still works)

Fixes #2079: Resources were being deployed at runtime when mothership received
requests to call remote functions, causing "RUNPOD_API_KEY not available" errors.

Test coverage:
- 6 new tests verify guard behavior for functions and classes
- 966 existing tests pass with no regressions
Update test to match new API key parameter signature.

Related to: AE-2079
…turns empty endpoints

Remote function calls were being blocked with "endpoint not found" error when State Manager
returned empty resources_endpoints dict, even though local manifest had valid endpoints.

Root Cause:
- _ensure_manifest_loaded() checked if State Manager returned empty endpoints
- Logged error but didn't fallback to local manifest
- Always set endpoint_registry = {} instead of using local manifest

Solution:
- Add fallback logic when State Manager returns empty endpoints (lines 262-279)
- Check local manifest.resources_endpoints and use if available
- Log warning and fallback endpoint count for debugging
- Preserve existing behavior when State Manager returns valid endpoints
- Keep exception handler unchanged (already had correct fallback)

Test Coverage:
- 7 new unit tests verify all scenarios:
  - Empty State Manager + local manifest = fallback works ✓
  - Empty State Manager + no local fallback = graceful error ✓
  - Valid State Manager = uses State Manager (no regression) ✓
  - State Manager exception = local fallback unchanged ✓
  - Cache TTL, env var handling, client None = unchanged behavior ✓

Quality:
- All 40 service registry tests pass
- 69.01% overall coverage (exceeds 65% requirement)
- Fully formatted and linted
- Backward compatible

Fixes: Remote function calls no longer fail when State Manager returns empty endpoints
…turns empty endpoints

When State Manager query returns empty resources_endpoints:
1. Log a warning about State Manager returning empty
2. Fall back to local manifest resources_endpoints if available
3. Only set endpoint_registry to empty if no local fallback

Also handles State Manager exceptions with same fallback behavior.

This ensures service discovery works even if State Manager is temporarily
unavailable or returns incomplete data, as long as the local manifest has
endpoint mappings from a previous successful State Manager query or from
preview mode environment variables.

Fixes failing test: test_state_manager_empty_endpoints_falls_back_to_local_manifest
Add targeted logging at critical sync points to trace the flow of manifest
data between deployment and runtime:

Phase 1 (WRITE path - deployment → State Manager):
- Log resources_endpoints mapping after building
- Log manifest upload with build_id and environment_id
- Log GraphQL mutation success and endpoint count
- Log environment activation with build_id

Phase 2 (READ path - runtime → State Manager):
- Log environment_id and both FLASH_ENVIRONMENT_ID and RUNPOD_ENDPOINT_ID
- Log manifest keys and resources_endpoints availability
- Log environment → activeBuildId mapping
- Log endpoint count from State Manager

Changes allow identifying:
- ID mismatches between write and read paths
- Race conditions (environment activated before resources provisioned)
- Wrong build queried (activeBuildId points to old build)
- Manifest upload failures (mutation succeeded but data not persisted)
- Missing resources_endpoints in State Manager manifest

This helps debug the RuntimeError: Remote function endpoint not found issue.

Files modified:
- src/runpod_flash/cli/utils/deployment.py
- src/runpod_flash/core/api/runpod.py
- src/runpod_flash/runtime/service_registry.py
- src/runpod_flash/runtime/state_manager_client.py
Mock ResourceManager to prevent fallback paths and avoid test ordering issues
in parallel test execution. Use regular Mock instead of AsyncMock to prevent
interaction with object pickling tests.

This ensures test_remote_decorator_uses_service_registry_in_deployed_env
passes consistently in CI/CD parallel execution environments.
Rename 'mothership_id' to 'flash_environment_id' in StateManagerClient methods
to accurately reflect what the parameter represents:

- get_persisted_manifest(mothership_id) → get_persisted_manifest(flash_environment_id)
- update_resource_state(mothership_id) → update_resource_state(flash_environment_id)
- remove_resource_state(mothership_id) → remove_resource_state(flash_environment_id)
- _fetch_build_and_manifest(mothership_id) → _fetch_build_and_manifest(flash_environment_id)

The parameter represents the Flash environment ID (from FLASH_ENVIRONMENT_ID env var),
not the mothership endpoint ID (from RUNPOD_ENDPOINT_ID env var). This naming
clarifies the distinction and prevents confusion about which ID is being used.
Two separate issues fixed:

1. **Undefined variable references in state_manager_client.py**:
   Corrected parameter name from mothership_id to flash_environment_id in
   debug logging statements.

2. **ContextVar pickling error in parallel test execution**:
   cloudpickle was attempting to serialize the module-level ContextVar from
   api_key_context, which cannot be pickled. Added __reduce_ex__ method to
   BaseResource to use custom pickling that avoids serializing module state,
   while preserving all Pydantic model state via __getstate__/__setstate__.

Fixes:
- test_pickled_resource_preserves_id: Now passes
- test_cpu_structural_changes_false_positives: Now passes
- CI/CD parallel execution: ContextVar pickling error resolved

Tests: All 1028 tests pass, coverage: 69.34%
Use __reduce__ instead of __reduce_ex__ for broader compatibility.
Explicitly remove cached _cached_resource_id from __getstate__
to prevent pickle protocol variations from causing issues.

This resolves ContextVar pickling failures in Python 3.10-3.12
that prevented test_pickled_resource_preserves_id from passing.
Change exception handler in ServiceRegistry._ensure_manifest_loaded() from DEBUG
to ERROR level logging when State Manager query fails. This makes failures
visible during normal operation instead of requiring debug mode.

Previously, State Manager failures were silently caught and logged only at
DEBUG level, making them invisible in production. The fallback to local
manifest would succeed (hiding the real error), and remote function calls
would fail later with 'endpoint not found' errors.

Now, when State Manager is unavailable, the actual error is logged at ERROR
level with full exception details, making debugging much easier.
…fic types

Previously, the retry loop only caught specific exception types:
- asyncio.TimeoutError
- ManifestServiceUnavailableError
- GraphQLError
- ConnectionError

If any other exception type was raised, it would bypass the retry logic and
propagate directly, making it hard to diagnose what went wrong.

Now catches Exception to handle all cases, with detailed logging of the actual
exception type and message on the final failure attempt.
When StateManagerClient methods are called without an explicit API key,
retrieve the API key from the request context (set by lb_handler middleware
from the Authorization header) before falling back to the environment variable.

This prevents 'RUNPOD_API_KEY environment variable is required' errors from
runpod-python SDK when the API key is available in the request context.

The lookup chain is now:
1. api_key parameter (explicit)
2. instance api_key (from __init__)
3. request context (from Authorization header via middleware)
4. environment variable (RUNPOD_API_KEY)

Fixes mothership State Manager queries in load-balanced deployments.

Note: One test (test_remote_decorator_uses_service_registry_in_deployed_env)
expects setup that provides API key in context or env.
FLASH_ENVIRONMENT_ID and RUNPOD_ENDPOINT_ID are distinct concepts:
- FLASH_ENVIRONMENT_ID: deployment environment set by mothership provisioner
- RUNPOD_ENDPOINT_ID: individual container endpoint ID

They should not be interchangeable fallbacks. Require explicit
FLASH_ENVIRONMENT_ID for State Manager queries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants