fix: enable cross-endpoint remote calls when required#196
Open
fix: enable cross-endpoint remote calls when required#196
Conversation
Optimizes cross-endpoint communication by skipping State Manager queries for local-only endpoints and injecting API keys only when needed. - ServiceRegistry checks makes_remote_calls to skip unnecessary queries - ServerlessResource injects RUNPOD_API_KEY for QB endpoints at deploy time - Added comprehensive API key management documentation
Strip -fb suffix and live- prefix from resource names when looking up configuration in manifest to ensure resources with these naming patterns are properly matched.
Fixes 'No module named 'torch'' error when @Remote functions are called in preview deployments. The decorator now correctly makes HTTP calls to worker containers instead of executing locally. Root cause: ServiceRegistry couldn't discover endpoint URLs in preview mode, causing @Remote decorator to fall back to local execution. Solution: - Pass resources_endpoints via FLASH_RESOURCES_ENDPOINTS env var - Load endpoint URLs from environment in ServiceRegistry - Add custom_endpoint_url field to LoadBalancerSlsResource - Map queue-based to load-balancer images in preview mode - Skip template validation when connecting to existing endpoints - Integrate ServiceRegistry into @Remote decorator wrapper - Add resources_endpoints field to Manifest model Verification: - Remote calls in preview mode now use HTTP to worker containers - No more torch import errors in mothership container - Proper endpoint URL construction for Docker DNS Related: AE-2079
…ptimization Add comprehensive integration tests for: - API key injection during QB endpoint deployment (makes_remote_calls flag) - State Manager query optimization for local-only endpoints - Safe defaults and edge case handling for both features Test coverage: - 6 tests for API key injection behavior (QB vs LB, with/without remote calls) - 6 tests for State Manager optimization (local-only vs remote-capable endpoints) All tests passing (12/12). Related: feat(runtime) commit 32dd688, fix(serverless) commit 556106d
Adds Flash environment ID as environment variable when creating resources. This enables runtime queries to State Manager using the correct ID. Part of ServiceRegistry empty endpoints fix.
Injects FLASH_ENVIRONMENT_ID for both QB and LB endpoints to enable State Manager queries at runtime. API key injection remains QB-only (env var). LB endpoints use Authorization headers from requests, not environment variables. Part of ServiceRegistry empty endpoints fix.
Changes runtime State Manager queries to use FLASH_ENVIRONMENT_ID (with fallback to RUNPOD_ENDPOINT_ID) instead of just RUNPOD_ENDPOINT_ID. Root cause: Manifest stored under Flash environment ID, not individual endpoint ID. Using wrong ID resulted in empty resources_endpoints. Adds diagnostic logging to help troubleshoot manifest retrieval issues. Fixes: Load balancer endpoints unable to discover child endpoints.
Logs endpoint URLs being uploaded to State Manager during deployment. Helps troubleshoot manifest synchronization issues between CLI and runtime. Part of ServiceRegistry empty endpoints fix.
Logs State Manager responses including presence of resources_endpoints. Helps troubleshoot runtime manifest loading issues. Part of ServiceRegistry empty endpoints fix.
Adds makes_remote_calls: True to load-balancer and mothership manifest entries. This ensures ServiceRegistry correctly identifies these endpoint types as needing State Manager queries for cross-endpoint routing. Part of ServiceRegistry empty endpoints fix.
Adds safe-default handling when makes_remote_calls is null/None in manifest. Returns True (allow remote calls) to ensure functionality works even with incomplete manifest data. Also adds unit tests for _check_makes_remote_calls(): - Returns True when makes_remote_calls is True - Returns False when makes_remote_calls is False - Returns True (safe default) when makes_remote_calls is null - Returns True (safe default) when resource not found in manifest Part of ServiceRegistry empty endpoints fix.
Removes runtime reconciliation pattern where mothership endpoints would auto-provision child endpoints on startup. This is replaced by CLI-time provisioning during `flash deploy`. Changes: - Remove reconcile_children(), is_mothership(), get_mothership_url() - Remove runtime startup reconciliation hook from LB handler - Remove is_mothership flag and FLASH_IS_MOTHERSHIP env var - Remove is_explicit flag from manifest (no longer needed) - Delete integration tests for reconciliation behavior - Update unit tests to not expect removed flags - Update deployment.py, preview.py, deploy.py to check resource name All resources are now provisioned upfront by the CLI, enabling a simpler peer-to-peer architecture where endpoints discover each other via State Manager using FLASH_ENVIRONMENT_ID. Part of ServiceRegistry empty endpoints fix.
…ning Updates comments and documentation in test-mothership command to clarify that it uses CLI-time provisioning (not runtime reconciliation). Changes: - Replace 'auto-provisioning' with 'CLI-time provisioning' - Update objective text to remove reconciliation references - Clarify that all provisioning happens before server starts - Remove outdated --auto-provision flag references The test-mothership command behavior is unchanged - it still provisions resources upfront before starting the server. This commit just fixes misleading terminology that suggested runtime reconciliation.
The test-mothership command was designed to test runtime reconciliation, which was removed in commits 880c905 and 881a72f. The command is now obsolete as it only tests CLI provisioning in Docker, which is already covered by 'flash deploy --preview'. Changes: - Delete test_mothership.py command file (457 lines) - Remove FLASH_IS_TEST_MOTHERSHIP flag handling from mothership_provisioner.py - Simplify resource naming logic (removed tmp- prefix logic) - Update comments in manifest.py and scanner.py - Remove documentation references from Flash_Deploy_Guide.md and VERIFICATION.md Users should use 'flash deploy --preview' for local testing instead. Part of ServiceRegistry empty endpoints fix.
…eploy Removing is_mothership from the manifest (880c905) caused config mismatch between local and state manifests. This triggered unnecessary re-provisioning of the mothership on every deploy. When re-provisioning an unchanged endpoint, RunPod returns endpoint_url=None, so the URL is never written to resources_endpoints, breaking the post-deploy guidance display. - Restore is_mothership: True in _create_mothership_resource() - Restore is_mothership: True in _create_mothership_from_explicit() - Restore is_mothership flag check in _display_post_deployment_guidance() alongside the existing name-based check for forwards compatibility
Post-deploy guidance wasn't showing routes because it only iterated over
resources_endpoints dict, which is null/empty until after deployment
provisions resources. Routes exist in the manifest's routes section but
were never displayed.
Add fallback logic to search resources dict for mothership and extract
routes from manifest, even when resources_endpoints is empty. This allows
routes to be displayed pre-deployment and handles custom mothership names
via the is_mothership flag.
- Add fallback loop after primary resources_endpoints check
- Handles both standard ("mothership") and custom names via is_mothership
- Preserves existing behavior when resources_endpoints is populated
Mothership routes were not being added to manifest's routes dict when
using custom naming patterns (e.g., '03_05_load_balancer-mothership').
The route extraction logic only checked for exact names 'mothership'
or 'mothership-entrypoint', missing projects with {project_name}-mothership
naming.
This caused post-deploy guidance to show 'No routes found in manifest'
even though routes were defined in the mothership's functions array.
Change: Use is_mothership flag instead of exact name matching to extract
routes from all mothership resources regardless of naming pattern.
- Replace hardcoded name check with is_mothership flag
- Ensures routes extracted for custom-named mothership resources
- Aligns with how deploy.py already identifies mothership
- Single line change with broad compatibility impact
d7ac644 to
ae12502
Compare
FastAPI route handlers were never analyzed for calls to @Remote functions, causing makes_remote_calls flag to always be False even when mothership called remote endpoints. Changes: - scanner.py: Add remote_function_names parameter to detect_main_app() - scanner.py: Add _analyze_fastapi_routes_for_remote_calls() function - scanner.py: Add _analyze_function_calls_for_route() function - manifest.py: Pass remote_function_names to detect_main_app() - manifest.py: Calculate makes_remote_calls from analyzed routes Impact: - Mothership with remote calls: makes_remote_calls=True (gets API key) - Mothership without remote calls: makes_remote_calls=False (skips State Manager) - Enables proper API key injection for QB workers - Enables service registry optimization Fixes: AE-2079
…rn type Cleanup unused parameter and fix return value bug in deployment flow. Changes: - Remove mothership_url parameter from create_resource_from_manifest() - Fix deploy_to_environment() to return resources_endpoints dict instead of deploy API response - Update return type annotation from Dict[str, Any] to Dict[str, str] - Add comprehensive docstring with Args/Returns/Raises - Update tests to match new return type expectations Impact: - Cleaner API (no dead parameters) - Correct return values (endpoint URLs instead of API response) - Better type safety and documentation
8b326ca to
b906427
Compare
Add API key injection in preview mode for resources configured with makes_remote_calls=True. This ensures workers can authenticate remote calls to other endpoints. Changes: - Parse makes_remote_calls flag from manifest - Inject RUNPOD_API_KEY env var during container startup - Warn if API key needed but not set - Add integration tests for all three PRD scenarios: 1. Mothership → GPU worker 2. Mothership → CPU worker → GPU worker 3. Mothership local-only Tests validate resource parsing, endpoint routing, API key context, and environment variable fallback behavior.
b906427 to
efac58d
Compare
…ment Fix API key propagation for load-balanced endpoints by checking request context before falling back to environment variables in three locations: 1. RunpodGraphQLClient.__init__() - State Manager and GraphQL clients 2. RunpodRestClient.__init__() - REST API clients 3. ServerlessResource.endpoint property - runpod-python SDK endpoint client Changes establish unified priority: explicit param > request context > env var This fixes remote calls and State Manager queries for load-balanced endpoints that receive API keys via Authorization header rather than environment variables. All three components now automatically pick up per-request API keys from context, enabling multi-endpoint scenarios to work correctly. Fixes scenarios: - Mothership (LB) + GPU Worker (QB) - Mothership (LB) + CPU Worker (QB) + GPU Worker (QB) - Mothership (LB) only (local execution)
- Add FLASH_IS_MOTHERSHIP flag to distinguish mothership from worker endpoints - Extract main_file and app_variable from manifest for runtime provisioning - Store only filename (not absolute path) for container portability - Set FLASH_MAIN_FILE and FLASH_APP_VARIABLE in provisioner environment Fixes deployed mothership endpoints booting in child endpoint mode. Mothership now correctly loads user code from main.py at runtime.
Enhances API key propagation across three layers: 1. Middleware (lb_handler.py): Store keys in both contextvars and request.state for better propagation to async functions and explicit access. 2. Service Registry (service_registry.py): Retrieve API key from request context and pass explicitly to State Manager client. 3. State Manager Client (state_manager_client.py): Accept explicit api_key parameter at instance-level and per-request, with fallback to context/env. 4. Serverless Resource (serverless.py): Use request context API key for LB endpoint deployments during request processing. This ensures endpoints making remote calls have authenticated access to: - Other endpoints in the Flash application - State Manager GraphQL API for service discovery Tests updated/added to validate API key propagation through all layers. Fixes: AE-2079
Adds makes_remote_calls: True to load-balancer and mothership manifest entries. This ensures ServiceRegistry correctly identifies these endpoint types as needing State Manager queries for cross-endpoint routing. Part of ServiceRegistry empty endpoints fix.
Adds safe-default handling when makes_remote_calls is null/None in manifest. Returns True (allow remote calls) to ensure functionality works even with incomplete manifest data. Also adds unit tests for _check_makes_remote_calls(): - Returns True when makes_remote_calls is True - Returns False when makes_remote_calls is False - Returns True (safe default) when makes_remote_calls is null - Returns True (safe default) when resource not found in manifest Part of ServiceRegistry empty endpoints fix.
Removes runtime reconciliation pattern where mothership endpoints would auto-provision child endpoints on startup. This is replaced by CLI-time provisioning during `flash deploy`. Changes: - Remove reconcile_children(), is_mothership(), get_mothership_url() - Remove runtime startup reconciliation hook from LB handler - Remove is_mothership flag and FLASH_IS_MOTHERSHIP env var - Remove is_explicit flag from manifest (no longer needed) - Delete integration tests for reconciliation behavior - Update unit tests to not expect removed flags - Update deployment.py, preview.py, deploy.py to check resource name All resources are now provisioned upfront by the CLI, enabling a simpler peer-to-peer architecture where endpoints discover each other via State Manager using FLASH_ENVIRONMENT_ID. Part of ServiceRegistry empty endpoints fix.
…ning Updates comments and documentation in test-mothership command to clarify that it uses CLI-time provisioning (not runtime reconciliation). Changes: - Replace 'auto-provisioning' with 'CLI-time provisioning' - Update objective text to remove reconciliation references - Clarify that all provisioning happens before server starts - Remove outdated --auto-provision flag references The test-mothership command behavior is unchanged - it still provisions resources upfront before starting the server. This commit just fixes misleading terminology that suggested runtime reconciliation.
The test-mothership command was designed to test runtime reconciliation, which was removed in commits 880c905 and 881a72f. The command is now obsolete as it only tests CLI provisioning in Docker, which is already covered by 'flash deploy --preview'. Changes: - Delete test_mothership.py command file (457 lines) - Remove FLASH_IS_TEST_MOTHERSHIP flag handling from mothership_provisioner.py - Simplify resource naming logic (removed tmp- prefix logic) - Update comments in manifest.py and scanner.py - Remove documentation references from Flash_Deploy_Guide.md and VERIFICATION.md Users should use 'flash deploy --preview' for local testing instead. Part of ServiceRegistry empty endpoints fix.
…eploy Removing is_mothership from the manifest (880c905) caused config mismatch between local and state manifests. This triggered unnecessary re-provisioning of the mothership on every deploy. When re-provisioning an unchanged endpoint, RunPod returns endpoint_url=None, so the URL is never written to resources_endpoints, breaking the post-deploy guidance display. - Restore is_mothership: True in _create_mothership_resource() - Restore is_mothership: True in _create_mothership_from_explicit() - Restore is_mothership flag check in _display_post_deployment_guidance() alongside the existing name-based check for forwards compatibility
Post-deploy guidance wasn't showing routes because it only iterated over
resources_endpoints dict, which is null/empty until after deployment
provisions resources. Routes exist in the manifest's routes section but
were never displayed.
Add fallback logic to search resources dict for mothership and extract
routes from manifest, even when resources_endpoints is empty. This allows
routes to be displayed pre-deployment and handles custom mothership names
via the is_mothership flag.
- Add fallback loop after primary resources_endpoints check
- Handles both standard ("mothership") and custom names via is_mothership
- Preserves existing behavior when resources_endpoints is populated
Mothership routes were not being added to manifest's routes dict when
using custom naming patterns (e.g., '03_05_load_balancer-mothership').
The route extraction logic only checked for exact names 'mothership'
or 'mothership-entrypoint', missing projects with {project_name}-mothership
naming.
This caused post-deploy guidance to show 'No routes found in manifest'
even though routes were defined in the mothership's functions array.
Change: Use is_mothership flag instead of exact name matching to extract
routes from all mothership resources regardless of naming pattern.
- Replace hardcoded name check with is_mothership flag
- Ensures routes extracted for custom-named mothership resources
- Aligns with how deploy.py already identifies mothership
- Single line change with broad compatibility impact
FastAPI route handlers were never analyzed for calls to @Remote functions, causing makes_remote_calls flag to always be False even when mothership called remote endpoints. Changes: - scanner.py: Add remote_function_names parameter to detect_main_app() - scanner.py: Add _analyze_fastapi_routes_for_remote_calls() function - scanner.py: Add _analyze_function_calls_for_route() function - manifest.py: Pass remote_function_names to detect_main_app() - manifest.py: Calculate makes_remote_calls from analyzed routes Impact: - Mothership with remote calls: makes_remote_calls=True (gets API key) - Mothership without remote calls: makes_remote_calls=False (skips State Manager) - Enables proper API key injection for QB workers - Enables service registry optimization Fixes: AE-2079
…rn type Cleanup unused parameter and fix return value bug in deployment flow. Changes: - Remove mothership_url parameter from create_resource_from_manifest() - Fix deploy_to_environment() to return resources_endpoints dict instead of deploy API response - Update return type annotation from Dict[str, Any] to Dict[str, str] - Add comprehensive docstring with Args/Returns/Raises - Update tests to match new return type expectations Impact: - Cleaner API (no dead parameters) - Correct return values (endpoint URLs instead of API response) - Better type safety and documentation
Add API key injection in preview mode for resources configured with makes_remote_calls=True. This ensures workers can authenticate remote calls to other endpoints. Changes: - Parse makes_remote_calls flag from manifest - Inject RUNPOD_API_KEY env var during container startup - Warn if API key needed but not set - Add integration tests for all three PRD scenarios: 1. Mothership → GPU worker 2. Mothership → CPU worker → GPU worker 3. Mothership local-only Tests validate resource parsing, endpoint routing, API key context, and environment variable fallback behavior.
…ment Fix API key propagation for load-balanced endpoints by checking request context before falling back to environment variables in three locations: 1. RunpodGraphQLClient.__init__() - State Manager and GraphQL clients 2. RunpodRestClient.__init__() - REST API clients 3. ServerlessResource.endpoint property - runpod-python SDK endpoint client Changes establish unified priority: explicit param > request context > env var This fixes remote calls and State Manager queries for load-balanced endpoints that receive API keys via Authorization header rather than environment variables. All three components now automatically pick up per-request API keys from context, enabling multi-endpoint scenarios to work correctly. Fixes scenarios: - Mothership (LB) + GPU Worker (QB) - Mothership (LB) + CPU Worker (QB) + GPU Worker (QB) - Mothership (LB) only (local execution)
- Add FLASH_IS_MOTHERSHIP flag to distinguish mothership from worker endpoints - Extract main_file and app_variable from manifest for runtime provisioning - Store only filename (not absolute path) for container portability - Set FLASH_MAIN_FILE and FLASH_APP_VARIABLE in provisioner environment Fixes deployed mothership endpoints booting in child endpoint mode. Mothership now correctly loads user code from main.py at runtime.
Enhances API key propagation across three layers: 1. Middleware (lb_handler.py): Store keys in both contextvars and request.state for better propagation to async functions and explicit access. 2. Service Registry (service_registry.py): Retrieve API key from request context and pass explicitly to State Manager client. 3. State Manager Client (state_manager_client.py): Accept explicit api_key parameter at instance-level and per-request, with fallback to context/env. 4. Serverless Resource (serverless.py): Use request context API key for LB endpoint deployments during request processing. This ensures endpoints making remote calls have authenticated access to: - Other endpoints in the Flash application - State Manager GraphQL API for service discovery Tests updated/added to validate API key propagation through all layers. Fixes: AE-2079
…ainers Prevent @Remote decorator and remote classes from attempting to deploy resources at runtime in deployed environments. Resources must be deployed via 'flash deploy' CLI command before runtime execution. Changes: - Guard @Remote decorator wrapper with deployment context check - Guard remote class initialization with deployment context check - Add optional allow_runtime_deployment safeguard to ResourceManager (defense-in-depth) - Raise clear error messages when endpoints missing from manifest in deployed env - Preserve local development behavior (on-demand deployment still works) Fixes #2079: Resources were being deployed at runtime when mothership received requests to call remote functions, causing "RUNPOD_API_KEY not available" errors. Test coverage: - 6 new tests verify guard behavior for functions and classes - 966 existing tests pass with no regressions
Update test to match new API key parameter signature. Related to: AE-2079
…turns empty endpoints
Remote function calls were being blocked with "endpoint not found" error when State Manager
returned empty resources_endpoints dict, even though local manifest had valid endpoints.
Root Cause:
- _ensure_manifest_loaded() checked if State Manager returned empty endpoints
- Logged error but didn't fallback to local manifest
- Always set endpoint_registry = {} instead of using local manifest
Solution:
- Add fallback logic when State Manager returns empty endpoints (lines 262-279)
- Check local manifest.resources_endpoints and use if available
- Log warning and fallback endpoint count for debugging
- Preserve existing behavior when State Manager returns valid endpoints
- Keep exception handler unchanged (already had correct fallback)
Test Coverage:
- 7 new unit tests verify all scenarios:
- Empty State Manager + local manifest = fallback works ✓
- Empty State Manager + no local fallback = graceful error ✓
- Valid State Manager = uses State Manager (no regression) ✓
- State Manager exception = local fallback unchanged ✓
- Cache TTL, env var handling, client None = unchanged behavior ✓
Quality:
- All 40 service registry tests pass
- 69.01% overall coverage (exceeds 65% requirement)
- Fully formatted and linted
- Backward compatible
Fixes: Remote function calls no longer fail when State Manager returns empty endpoints
…turns empty endpoints When State Manager query returns empty resources_endpoints: 1. Log a warning about State Manager returning empty 2. Fall back to local manifest resources_endpoints if available 3. Only set endpoint_registry to empty if no local fallback Also handles State Manager exceptions with same fallback behavior. This ensures service discovery works even if State Manager is temporarily unavailable or returns incomplete data, as long as the local manifest has endpoint mappings from a previous successful State Manager query or from preview mode environment variables. Fixes failing test: test_state_manager_empty_endpoints_falls_back_to_local_manifest
Add targeted logging at critical sync points to trace the flow of manifest data between deployment and runtime: Phase 1 (WRITE path - deployment → State Manager): - Log resources_endpoints mapping after building - Log manifest upload with build_id and environment_id - Log GraphQL mutation success and endpoint count - Log environment activation with build_id Phase 2 (READ path - runtime → State Manager): - Log environment_id and both FLASH_ENVIRONMENT_ID and RUNPOD_ENDPOINT_ID - Log manifest keys and resources_endpoints availability - Log environment → activeBuildId mapping - Log endpoint count from State Manager Changes allow identifying: - ID mismatches between write and read paths - Race conditions (environment activated before resources provisioned) - Wrong build queried (activeBuildId points to old build) - Manifest upload failures (mutation succeeded but data not persisted) - Missing resources_endpoints in State Manager manifest This helps debug the RuntimeError: Remote function endpoint not found issue. Files modified: - src/runpod_flash/cli/utils/deployment.py - src/runpod_flash/core/api/runpod.py - src/runpod_flash/runtime/service_registry.py - src/runpod_flash/runtime/state_manager_client.py
Mock ResourceManager to prevent fallback paths and avoid test ordering issues in parallel test execution. Use regular Mock instead of AsyncMock to prevent interaction with object pickling tests. This ensures test_remote_decorator_uses_service_registry_in_deployed_env passes consistently in CI/CD parallel execution environments.
Rename 'mothership_id' to 'flash_environment_id' in StateManagerClient methods to accurately reflect what the parameter represents: - get_persisted_manifest(mothership_id) → get_persisted_manifest(flash_environment_id) - update_resource_state(mothership_id) → update_resource_state(flash_environment_id) - remove_resource_state(mothership_id) → remove_resource_state(flash_environment_id) - _fetch_build_and_manifest(mothership_id) → _fetch_build_and_manifest(flash_environment_id) The parameter represents the Flash environment ID (from FLASH_ENVIRONMENT_ID env var), not the mothership endpoint ID (from RUNPOD_ENDPOINT_ID env var). This naming clarifies the distinction and prevents confusion about which ID is being used.
Two separate issues fixed: 1. **Undefined variable references in state_manager_client.py**: Corrected parameter name from mothership_id to flash_environment_id in debug logging statements. 2. **ContextVar pickling error in parallel test execution**: cloudpickle was attempting to serialize the module-level ContextVar from api_key_context, which cannot be pickled. Added __reduce_ex__ method to BaseResource to use custom pickling that avoids serializing module state, while preserving all Pydantic model state via __getstate__/__setstate__. Fixes: - test_pickled_resource_preserves_id: Now passes - test_cpu_structural_changes_false_positives: Now passes - CI/CD parallel execution: ContextVar pickling error resolved Tests: All 1028 tests pass, coverage: 69.34%
Use __reduce__ instead of __reduce_ex__ for broader compatibility. Explicitly remove cached _cached_resource_id from __getstate__ to prevent pickle protocol variations from causing issues. This resolves ContextVar pickling failures in Python 3.10-3.12 that prevented test_pickled_resource_preserves_id from passing.
Change exception handler in ServiceRegistry._ensure_manifest_loaded() from DEBUG to ERROR level logging when State Manager query fails. This makes failures visible during normal operation instead of requiring debug mode. Previously, State Manager failures were silently caught and logged only at DEBUG level, making them invisible in production. The fallback to local manifest would succeed (hiding the real error), and remote function calls would fail later with 'endpoint not found' errors. Now, when State Manager is unavailable, the actual error is logged at ERROR level with full exception details, making debugging much easier.
…fic types Previously, the retry loop only caught specific exception types: - asyncio.TimeoutError - ManifestServiceUnavailableError - GraphQLError - ConnectionError If any other exception type was raised, it would bypass the retry logic and propagate directly, making it hard to diagnose what went wrong. Now catches Exception to handle all cases, with detailed logging of the actual exception type and message on the final failure attempt.
When StateManagerClient methods are called without an explicit API key, retrieve the API key from the request context (set by lb_handler middleware from the Authorization header) before falling back to the environment variable. This prevents 'RUNPOD_API_KEY environment variable is required' errors from runpod-python SDK when the API key is available in the request context. The lookup chain is now: 1. api_key parameter (explicit) 2. instance api_key (from __init__) 3. request context (from Authorization header via middleware) 4. environment variable (RUNPOD_API_KEY) Fixes mothership State Manager queries in load-balanced deployments. Note: One test (test_remote_decorator_uses_service_registry_in_deployed_env) expects setup that provides API key in context or env.
FLASH_ENVIRONMENT_ID and RUNPOD_ENDPOINT_ID are distinct concepts: - FLASH_ENVIRONMENT_ID: deployment environment set by mothership provisioner - RUNPOD_ENDPOINT_ID: individual container endpoint ID They should not be interchangeable fallbacks. Require explicit FLASH_ENVIRONMENT_ID for State Manager queries.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes "No module named 'torch'" error when
@remotefunctions are called in deployments. The decorator now correctly makes HTTP calls to worker containers instead of executing locally.This PR includes the complete fix plus supporting infrastructure for API key management and endpoint discovery optimization.
Changes
Main Fix (AE-2079)
@remotedecorator to fall back to local executionFLASH_RESOURCES_ENDPOINTSenv var and load endpoint URLs from environmentSupporting Features
API Key Management
makes_remote_callsto skip unnecessary State Manager queries for local-only endpointsRUNPOD_API_KEYfor QB endpoints at deploy timeResource Name Normalization
-fbsuffix andlive-prefix from resource names for manifest lookupsIntegration Tests
Files Changed
preview.py,service_registry.py,models.pyserverless.py,load_balancer_sls_resource.pyclient.pydocs/API_Key_Management.md,CLAUDE.mdtest_api_key_injection.py,test_local_only_endpoint.pyTest Plan
Related