-
-
Notifications
You must be signed in to change notification settings - Fork 724
Add RE2 supervisor #1769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RE2 supervisor #1769
Conversation
|
WalkthroughThis pull request updates an import in the coordinator module and introduces a comprehensive set of new files for a supervisor application. New configuration files, documentation, and package definitions are added alongside a host of TypeScript source files. These files implement workload management, resource monitoring, Kubernetes and Docker client functionality, and a dedicated HTTP/WebSocket server for managing runs and communications. In addition, the core package has been refactored by removing the old checkpoints export and introducing a new server-only module with updated checkpoint schemas and a dedicated checkpoint client. Changes
Sequence Diagram(s)sequenceDiagram
participant MS as ManagedSupervisor
participant HS as HttpServer
participant WS as WorkloadServer
participant WM as WorkloadManager
participant Client as Worker Client
MS->>HS: start()
HS->>HS: Listen for HTTP requests
MS->>WS: start()
WS->>Client: Establish WebSocket connection
Client->>WS: Send run notification (connect/disconnect)
WS->>MS: Forward run event notification
MS->>WM: Invoke create() for workload execution
WM-->>MS: Return workload creation response
sequenceDiagram
participant RC as ManagedRunController
participant CC as CheckpointClient
participant API as Server Endpoint
RC->>CC: suspendRun(run details)
CC->>API: POST suspend request
API-->>CC: Response (ok/failure)
CC-->>RC: Return suspend success/failure
RC->>CC: restoreRun(run details)
CC->>API: POST restore request
API-->>CC: Response (ok/failure)
CC-->>RC: Return restore success/failure
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (26)
apps/supervisor/src/util.ts (2)
3-8
: Consider adding more documentation to this function.This function correctly determines the Docker host domain based on the platform, but adding a comment explaining why different platforms require different values would improve maintainability.
export function getDockerHostDomain() { + // Docker Desktop on macOS and Windows uses 'host.docker.internal' to access host services + // while Linux typically uses 'localhost' or the host's network directly const isMacOs = process.platform === "darwin"; const isWindows = process.platform === "win32"; return isMacOs || isWindows ? "host.docker.internal" : "localhost"; }
10-24
: Consider adding input validation to the IdGenerator class.The class implementation is solid, but adding validation for constructor parameters would make it more robust against potential misuse.
constructor({ alphabet, length, prefix }: { alphabet: string; length: number; prefix: string }) { + if (!alphabet || alphabet.length === 0) { + throw new Error("Alphabet must not be empty"); + } + if (length <= 0) { + throw new Error("Length must be a positive number"); + } this.alphabet = alphabet; this.length = length; this.prefix = prefix; }apps/supervisor/src/workloadManager/types.ts (1)
8-10
: Consider using a more specific return type thanunknown
.Using
unknown
as the return type for thecreate
method reduces type safety. Consider defining a specific return type interface that represents the created workload or its identifier.export interface WorkloadManager { - create: (opts: WorkloadManagerCreateOptions) => Promise<unknown>; + create: (opts: WorkloadManagerCreateOptions) => Promise<WorkloadCreationResult>; } + +export interface WorkloadCreationResult { + id: string; + // Add other relevant properties +}apps/supervisor/src/env.ts (2)
14-18
: Add validation for positive numbers on port and interval values.Port numbers and intervals should be positive numbers. Consider adding a minimum value constraint to these numeric fields.
- TRIGGER_WORKLOAD_API_PORT: z.coerce.number().default(8020), - TRIGGER_WORKLOAD_API_PORT_EXTERNAL: z.coerce.number().default(8020), + TRIGGER_WORKLOAD_API_PORT: z.coerce.number().min(1).max(65535).default(8020), + TRIGGER_WORKLOAD_API_PORT_EXTERNAL: z.coerce.number().min(1).max(65535).default(8020), TRIGGER_WARM_START_URL: z.string().optional(), TRIGGER_CHECKPOINT_URL: z.string().optional(), - TRIGGER_DEQUEUE_INTERVAL_MS: z.coerce.number().int().default(1000), + TRIGGER_DEQUEUE_INTERVAL_MS: z.coerce.number().int().min(1).default(1000),
22-22
: Consider making OTEL_EXPORTER_OTLP_ENDPOINT optional or providing a default.The OTEL_EXPORTER_OTLP_ENDPOINT is required without a default value, which could make local development more difficult. Consider making it optional or providing a sensible default for development environments.
- OTEL_EXPORTER_OTLP_ENDPOINT: z.string().url(), + OTEL_EXPORTER_OTLP_ENDPOINT: z.string().url().optional(),Alternatively, if OpenTelemetry is always required:
- OTEL_EXPORTER_OTLP_ENDPOINT: z.string().url(), + OTEL_EXPORTER_OTLP_ENDPOINT: z.string().url().default("http://localhost:4318"),apps/supervisor/src/workloadManager/docker.ts (1)
47-53
: Error handling could be improvedWhile the method logs Docker command failures, it doesn't propagate the error to the caller or return a failure status. Consider enhancing the error handling to:
- Return a structured response indicating success/failure
- Provide more context about the specific error for debugging
- Consider retries for transient failures
- async create(opts: WorkloadManagerCreateOptions) { + async create(opts: WorkloadManagerCreateOptions): Promise<{ success: boolean; error?: string; runnerId?: string }> { this.logger.log("[DockerWorkloadProvider] Creating container", { opts }); const runnerId = RunnerId.generate(); // ... existing code ... try { const { stdout, stderr } = await x("docker", runArgs); this.logger.debug("[DockerWorkloadProvider] Create succeeded", { stdout, stderr }); + return { success: true, runnerId }; } catch (error) { this.logger.error("[DockerWorkloadProvider] Create failed:", { opts, error }); + return { + success: false, + error: error instanceof Error ? error.message : String(error) + }; } }packages/core/src/v3/serverOnly/checkpointClient.ts (1)
106-136
: Inconsistent error handling between methodsThe
restoreRun
method has less robust error handling compared tosuspendRun
. It doesn't:
- Log success cases (unlike
suspendRun
which logs debug on success)- Attempt to validate or parse the response body
- Handle JSON parsing errors
Consider standardizing the error handling approach between both methods.
async restoreRun({ runFriendlyId, snapshotFriendlyId, checkpoint, }: { runFriendlyId: string; snapshotFriendlyId: string; checkpoint: NonNullable<DequeuedMessage["checkpoint"]>; }): Promise<boolean> { const res = await fetch(this.restoreUrl, { method: "POST", headers: { "Content-Type": "application/json", }, body: JSON.stringify({ type: "DOCKER", containerId: checkpoint.location, } satisfies CheckpointServiceRestoreRequestBodyInput), }); if (!res.ok) { this.logger.error("[CheckpointClient] Restore request failed", { runFriendlyId, snapshotFriendlyId, checkpoint, }); return false; } + this.logger.debug("[CheckpointClient] Restore request success", { + runFriendlyId, + snapshotFriendlyId, + location: checkpoint.location, + status: res.status, + }); + return true; }apps/supervisor/src/clients/kubernetes.ts (2)
4-4
: Use a configurable logger instead ofconsole.log
While this console.log call is harmless in development, consider using a structured logger for maintainability and to keep logging output consistent across environments.function getKubeConfig() { - console.log("getKubeConfig()", { RUNTIME_ENV }); + logger.debug("getKubeConfig()", { RUNTIME_ENV });
25-34
: Provide a more descriptive fallback or default case
UsingassertExhaustive
ensures no unseen environment strings pass, but consider providing a meaningful error message or fallback to prevent abrupt runtime errors if unexpected values occur.default: - assertExhaustive(RUNTIME_ENV); + throw new Error(`Unsupported runtime environment: ${RUNTIME_ENV}`);apps/supervisor/src/workloadManager/kubernetes.ts (3)
22-22
: Consider making the namespace configurable
Hardcoding"default"
as the namespace might limit flexibility in multi-namespace setups. Consider accepting a namespace parameter or environment variable.- private namespace = "default"; + private readonly namespace: string; ... constructor(private opts: WorkloadManagerOptions) { this.namespace = opts.namespace ?? "default"; ... }
60-60
: Document ephemeral storage usage in container spec
Including ephemeral storage in resource requests and limits is great. Ensure that your deployment’s storageclass or ephemeral storage solution is fully configured, so pods do not fail due to ephemeral storage constraints.
60-117
: Validate successful pod creation and report useful errors
Currently, you log the creation attempt and catch errors, but no success logging is present. Consider adding a log statement or status check aftercreateNamespacedPod
.await this.k8s.core.createNamespacedPod({ namespace: this.namespace, ... }); +this.logger.log("[KubernetesWorkloadManager] Pod created successfully", { + runnerId, + namespace: this.namespace, + image: opts.image, +});packages/core/src/v3/serverOnly/httpServer.ts (5)
67-78
: Implement trace or debug logging for request details
You are logging at the info/error level, but you might gain insights with additional debug or trace logs for request payloads, headers, or more granular access logs.
92-97
: Return more descriptive error body for unhandled routes
Currently, only an empty 404 response is returned. Consider returning a JSON body or text message notifying the client the route does not exist.if (!route) { - return reply.empty(404); + return reply.json({ error: "Route Not Found" }, false, 404); }
104-111
: Clarify method mismatch response
For a route that exists but does not support a certain method, returning an empty 405 might be unclear to clients. A text or JSON response could improve discoverability.if (!routeDefinition) { - return reply.empty(405); + return reply.text(`Method ${httpMethod.data} not allowed for route ${route}`, 405); }
131-142
: Expose Zod validation errors to the caller if appropriate
For an invalid body, you simply respond with"Invalid body"
. Dev or test environments might benefit from seeing the entire validation error. Consider returning partial or detailed error data, or logging it in debug mode.
228-259
: Confirm URL path parsing logic for dynamic routes
The route matching compares segment lengths strictly. This may cause edge cases if you desire routes with optional parts or trailing slashes. Evaluate if that is acceptable or if you need a more flexible matching system.apps/supervisor/src/index.ts (3)
24-24
: Consider renaming the logger context to align with the class name.You're using
"managed-worker"
as the logger context inside theManagedSupervisor
class. While this is harmless, changing it to something more aligned with the class name (e.g.,"managed-supervisor"
) might improve clarity when filtering logs.
66-73
: Consider making the skipDequeue threshold configurable.At lines 72–73, runs are skipped if either CPU or memory availability is below 0.25. This value is hard-coded and might not be optimal in all environments. Making these thresholds configurable via environment variables or application settings could offer more fine-grained control over resource usage.
202-236
: Add a try/catch block around the fetch call for warm starts.If
fetch
to the warm start URL fails (for example, due to network errors), it could throw an unhandled exception. Wrapping the request in a try/catch block would prevent unhandled rejections and allow for more graceful error logging or fallback handling.apps/supervisor/src/workloadServer/index.ts (2)
107-135
: Return more descriptive error responses.When run attempts fail to start or complete, you return an empty 500 response (
reply.empty(500)
). Including a JSON body with an error message could help clients diagnose issues more quickly.- console.error("Failed to start run"); - reply.empty(500); + console.error("Failed to start run", { error: startResponse.error }); + reply.json( + { + ok: false, + error: startResponse.error?.message ?? "Failed to start run", + }, + false, + 500 + );
196-257
: Provide a clearer suspend operation lifecycle.Currently, you immediately return 202 with
{ ok: true }
before the container is actually suspended. This is a valid asynchronous pattern, but consider returning a link to check status or only returning success after confirming suspension, guiding clients on how to verify it.apps/supervisor/src/resourceMonitor.ts (4)
22-31
: Consider concurrency safety and negative resource availability checks.Currently,
cachedResources
can be decremented without a corresponding check to ensure it doesn't go negative. Concurrent calls toblockResources
might race, especially if multiple threads/processes or asynchronous tasks attempt to block resources at once. Consider adding locking or other concurrency controls, and ensure resource counters never drop below zero in multi-threaded scenarios.
103-152
: Optimize container inspection for large Docker fleets.In
DockerResourceMonitor.getNodeResources
, container inspection happens sequentially in a loop. If there are many containers, this can become a bottleneck. Consider usingPromise.all
to parallelize these inspections, or gathering necessary container metadata in a single API call if the Docker API permits.
154-209
: Assess scalability for cluster-wide queries.
KubernetesResourceMonitor.getNodeResources
retrieves all running pods on a node. For clusters with a high pod count, this could be large. Confirm that thelistPodForAllNamespaces
call won't cause performance issues and consider using more targeted queries or caching if necessary.
211-246
: Handle additional suffixes and validate parseability.The Kubernetes parser handles “Ki”, “Mi”, and “Gi” memory suffixes and “m” for CPU. However, Kubernetes can also provide larger or different suffixes (e.g., “Ti”, uppercase variants, etc.). Consider implementing logic for these, or at least ensuring unrecognized suffixes are handled gracefully. Also note that
parseInt
may discard decimal portions.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pnpm-lock.yaml
is excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (23)
apps/coordinator/src/checkpointer.ts
(1 hunks)apps/supervisor/.env.example
(1 hunks)apps/supervisor/.nvmrc
(1 hunks)apps/supervisor/README.md
(1 hunks)apps/supervisor/package.json
(1 hunks)apps/supervisor/src/clients/kubernetes.ts
(1 hunks)apps/supervisor/src/env.ts
(1 hunks)apps/supervisor/src/index.ts
(1 hunks)apps/supervisor/src/resourceMonitor.ts
(1 hunks)apps/supervisor/src/util.ts
(1 hunks)apps/supervisor/src/workloadManager/docker.ts
(1 hunks)apps/supervisor/src/workloadManager/kubernetes.ts
(1 hunks)apps/supervisor/src/workloadManager/types.ts
(1 hunks)apps/supervisor/src/workloadServer/index.ts
(1 hunks)apps/supervisor/tsconfig.json
(1 hunks)packages/cli-v3/src/entryPoints/managed-run-controller.ts
(2 hunks)packages/core/package.json
(3 hunks)packages/core/src/v3/checkpoints/index.ts
(0 hunks)packages/core/src/v3/schemas/checkpoints.ts
(1 hunks)packages/core/src/v3/schemas/index.ts
(1 hunks)packages/core/src/v3/serverOnly/checkpointClient.ts
(1 hunks)packages/core/src/v3/serverOnly/httpServer.ts
(1 hunks)packages/core/src/v3/serverOnly/index.ts
(1 hunks)
💤 Files with no reviewable changes (1)
- packages/core/src/v3/checkpoints/index.ts
✅ Files skipped from review due to trivial changes (7)
- packages/core/src/v3/serverOnly/index.ts
- apps/supervisor/tsconfig.json
- apps/supervisor/package.json
- apps/coordinator/src/checkpointer.ts
- apps/supervisor/.nvmrc
- apps/supervisor/.env.example
- apps/supervisor/README.md
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (18)
packages/core/src/v3/schemas/index.ts (1)
15-15
: Added export for checkpoint schemas looks good.This new export properly follows the established pattern in the file and aligns with the PR's objective of restructuring the checkpoint functionality as part of replacing providers and coordinators.
apps/supervisor/src/util.ts (1)
26-30
: RunnerId configuration looks good.The configuration uses an appropriate alphabet (excluding easily confused characters like '0'), sufficient length for uniqueness, and a clear prefix.
apps/supervisor/src/workloadManager/types.ts (1)
3-6
: Well-structured interfaces with clear requirements.Both
WorkloadManagerOptions
andWorkloadManagerCreateOptions
are well-designed with appropriate required and optional properties. The comprehensive set of identifiers inWorkloadManagerCreateOptions
provides good context for workload creation.Also applies to: 12-26
packages/cli-v3/src/entryPoints/managed-run-controller.ts (2)
694-700
: Required warm start URL enhances reliabilityThe code now requires
TRIGGER_WARM_START_URL
to be explicitly defined rather than falling back to constructing a URL fromTRIGGER_WORKER_API_URL
. This approach provides better clarity and reliability by:
- Making the warm start dependency explicit
- Failing fast with a clear error message when the required environment variable isn't available
- Gracefully shutting down with exit code 0 when warm starts are disabled
This change supports the PR's goal of replacing the coordinator and providers with a more structured approach.
714-715
: Documentation for future enhancementThis TODO comment indicates a planned improvement to retrieve configuration values from the warm start service rather than using environment variables directly. This aligns with good practice of centralizing configuration management.
packages/core/package.json (3)
62-62
: New serverOnly export addresses client bundle concernsAdding the
./v3/serverOnly
export to the module mapping is the first step in implementing the PR's goal of creating a dedicated server-only interface to prevent client bundle issues in the future.
181-182
: Type definitions for new serverOnly moduleThe addition of the TypeScript type definitions for the serverOnly module ensures proper type safety when importing from this new module path.
666-675
: Complete implementation of serverOnly module exportsThe export configuration ensures the new serverOnly module is properly accessible through both ESM and CommonJS imports with the correct type definitions. This implementation effectively replaces the previous checkpoints export while maintaining consistent module structure.
apps/supervisor/src/workloadManager/docker.ts (2)
11-15
: Well-structured Docker workload manager implementationThe
DockerWorkloadManager
class correctly implements theWorkloadManager
interface and uses structured logging for improved observability. The constructor accepts configuration options, following good OOP practices.
16-43
: Comprehensive Docker container configurationThe
create
method effectively builds a Docker container configuration with:
- Network settings
- Essential environment variables for the Trigger.dev ecosystem
- Conditional warm start URL configuration
- Optional resource constraints based on machine presets
The implementation is thorough and well-structured, accounting for different deployment scenarios.
packages/core/src/v3/serverOnly/checkpointClient.ts (2)
15-32
: Well-designed client for checkpoint operationsThe
CheckpointClient
class provides a clean API for interacting with the checkpoint service. It:
- Uses a structured logger for consistent logging
- Encapsulates URL construction with getter methods
- Takes dependencies via constructor injection for better testability
This design follows good software engineering principles.
33-104
: Comprehensive suspend operation with error handlingThe
suspendRun
method:
- Builds the appropriate request with callback URLs
- Sends a POST request to the checkpoint service
- Validates the response against a schema
- Returns a boolean indicating success
- Logs errors at appropriate detail levels
The implementation handles network errors, invalid responses, and schema validation failures correctly.
apps/supervisor/src/workloadManager/kubernetes.ts (2)
12-13
: Ensure ephemeral storage is adequate for workloads
You are requesting 2Gi and limiting to 10Gi of ephemeral storage by default. Verify these values meet your workloads’ usage patterns, especially for builds or large intermediate files.
152-162
: Handle unknown environment types
While you mapPRODUCTION
,STAGING
,DEVELOPMENT
, andPREVIEW
, you don’t guard against additional or unexpected environment types at runtime.#envTypeToLabelValue(type: EnvironmentType) { switch (type) { case "PRODUCTION": return "prod"; case "STAGING": return "stg"; case "DEVELOPMENT": return "dev"; case "PREVIEW": return "preview"; + default: + logger.warn(`Unrecognized environment type: ${type}`); + return "unknown"; } }apps/supervisor/src/index.ts (1)
41-45
: Verify that passing an empty string as the Kubernetes namespace is intended.In the
KubernetesResourceMonitor(createK8sApi(), "")
call, an empty string for the namespace might cause the resource monitor to fall back to default namespaces. If this is unintended, consider parameterizing or reading the namespace from the environment.Would you like me to generate a shell script to search for references to this namespace usage?
packages/core/src/v3/schemas/checkpoints.ts (1)
1-59
: All schemas and type definitions appear well-structured and clear.These Zod schemas comprehensively validate checkpoint requests and responses. The discriminated union for Docker vs. Kubernetes smoothly handles distinct container types, and the callback URL transformation is succinct. Nice work!
apps/supervisor/src/workloadServer/index.ts (1)
402-505
: Confirm that only one run can share a socket at any time.The logic in
runConnected
removes any existing run from the socket. If two runs with the same ID or the same socket concurrency is expected, ensure no race conditions occur when connecting or disconnecting. Otherwise, you might inadvertently override or remove a run association.apps/supervisor/src/resourceMonitor.ts (1)
41-51
: Validate requested resources before blocking.
blockResources
subtracts CPU and memory fromcpuAvailable
andmemoryAvailable
without checking if there's enough capacity. This could lead to negative resource values. To provide safer assumptions for downstream logic, consider validating (e.g., returning early or throwing an error) if the capacity is insufficient.
This will replace all providers and the coordinator. Also adds a new
core/v3/serverOnly
export to prevent future issues with client bundles.Summary by CodeRabbit
New Features
Documentation