Skip to content

Add RE2 supervisor #1769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 6, 2025
Merged

Add RE2 supervisor #1769

merged 4 commits into from
Mar 6, 2025

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Mar 6, 2025

This will replace all providers and the coordinator. Also adds a new core/v3/serverOnly export to prevent future issues with client bundles.

Summary by CodeRabbit

  • New Features

    • Enhanced workload management with real-time notifications, supporting both containerized and cluster deployments.
    • Improved resource monitoring and checkpoint operations for more robust workload supervision.
    • Streamlined startup and error handling for smoother system operations.
  • Documentation

    • Introduced a comprehensive setup guide and detailed configuration examples.
    • Added version management files to ensure consistent development environments.

Copy link

changeset-bot bot commented Mar 6, 2025

⚠️ No Changeset found

Latest commit: eedc52b

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Mar 6, 2025

Walkthrough

This pull request updates an import in the coordinator module and introduces a comprehensive set of new files for a supervisor application. New configuration files, documentation, and package definitions are added alongside a host of TypeScript source files. These files implement workload management, resource monitoring, Kubernetes and Docker client functionality, and a dedicated HTTP/WebSocket server for managing runs and communications. In addition, the core package has been refactored by removing the old checkpoints export and introducing a new server-only module with updated checkpoint schemas and a dedicated checkpoint client.

Changes

File(s) Change Summary
apps/coordinator/src/checkpointer.ts Updated import source for testDockerCheckpoint from @trigger.dev/core/v3/checkpoints to @trigger.dev/core/v3/serverOnly.
apps/supervisor/{.env.example, .nvmrc, README.md, package.json, tsconfig.json} Added new configuration, documentation, Node.js version specification, package metadata, and TypeScript config for the supervisor application.
apps/supervisor/src/{clients/kubernetes.ts, env.ts, index.ts, resourceMonitor.ts, util.ts, workloadManager/*, workloadServer/index.ts} Introduced new files for Kubernetes client integration, environment schema, supervisor orchestration (ManagedSupervisor), resource monitoring (Docker & Kubernetes), utility functions, workload managers for Docker and Kubernetes, and an HTTP/WebSocket server for workload communications.
packages/cli-v3/src/entryPoints/managed-run-controller.ts Modified warm start URL handling in the waitForNextRun method to explicitly check for env.TRIGGER_WARM_START_URL and exit if undefined.
packages/core/package.json Removed exports and type definitions for ./v3/checkpoints and added a new ./v3/serverOnly module with updated paths and types.
packages/core/src/v3/checkpoints/index.ts Removed the export of checkpoint functionalities.
packages/core/src/v3/schemas/{checkpoints.ts, index.ts} Added new Zod schemas and types for checkpoint service requests/responses and updated the export interface.
packages/core/src/v3/serverOnly/{checkpointClient.ts, httpServer.ts, index.ts} Introduced a new server-only module with a CheckpointClient for managing checkpoint operations, an HttpServer for handling HTTP requests, and consolidated exports for server-side functionality.

Sequence Diagram(s)

sequenceDiagram
    participant MS as ManagedSupervisor
    participant HS as HttpServer
    participant WS as WorkloadServer
    participant WM as WorkloadManager
    participant Client as Worker Client

    MS->>HS: start()
    HS->>HS: Listen for HTTP requests
    MS->>WS: start()
    WS->>Client: Establish WebSocket connection
    Client->>WS: Send run notification (connect/disconnect)
    WS->>MS: Forward run event notification
    MS->>WM: Invoke create() for workload execution
    WM-->>MS: Return workload creation response
Loading
sequenceDiagram
    participant RC as ManagedRunController
    participant CC as CheckpointClient
    participant API as Server Endpoint

    RC->>CC: suspendRun(run details)
    CC->>API: POST suspend request
    API-->>CC: Response (ok/failure)
    CC-->>RC: Return suspend success/failure

    RC->>CC: restoreRun(run details)
    CC->>API: POST restore request
    API-->>CC: Response (ok/failure)
    CC-->>RC: Return restore success/failure
Loading

Poem

I'm a bunny in the code, hopping with delight,
New files and features make my day so bright.
Workloads and checkpoints now dance in a row,
With HTTP and WebSocket, the connections flow.
I nibble on code like fresh, crunchy hay,
Celebrating these changes in a joyful bunny way!
🐇💻

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (26)
apps/supervisor/src/util.ts (2)

3-8: Consider adding more documentation to this function.

This function correctly determines the Docker host domain based on the platform, but adding a comment explaining why different platforms require different values would improve maintainability.

 export function getDockerHostDomain() {
+  // Docker Desktop on macOS and Windows uses 'host.docker.internal' to access host services
+  // while Linux typically uses 'localhost' or the host's network directly
   const isMacOs = process.platform === "darwin";
   const isWindows = process.platform === "win32";

   return isMacOs || isWindows ? "host.docker.internal" : "localhost";
 }

10-24: Consider adding input validation to the IdGenerator class.

The class implementation is solid, but adding validation for constructor parameters would make it more robust against potential misuse.

 constructor({ alphabet, length, prefix }: { alphabet: string; length: number; prefix: string }) {
+  if (!alphabet || alphabet.length === 0) {
+    throw new Error("Alphabet must not be empty");
+  }
+  if (length <= 0) {
+    throw new Error("Length must be a positive number");
+  }
   this.alphabet = alphabet;
   this.length = length;
   this.prefix = prefix;
 }
apps/supervisor/src/workloadManager/types.ts (1)

8-10: Consider using a more specific return type than unknown.

Using unknown as the return type for the create method reduces type safety. Consider defining a specific return type interface that represents the created workload or its identifier.

 export interface WorkloadManager {
-  create: (opts: WorkloadManagerCreateOptions) => Promise<unknown>;
+  create: (opts: WorkloadManagerCreateOptions) => Promise<WorkloadCreationResult>;
 }
+
+export interface WorkloadCreationResult {
+  id: string;
+  // Add other relevant properties
+}
apps/supervisor/src/env.ts (2)

14-18: Add validation for positive numbers on port and interval values.

Port numbers and intervals should be positive numbers. Consider adding a minimum value constraint to these numeric fields.

-  TRIGGER_WORKLOAD_API_PORT: z.coerce.number().default(8020),
-  TRIGGER_WORKLOAD_API_PORT_EXTERNAL: z.coerce.number().default(8020),
+  TRIGGER_WORKLOAD_API_PORT: z.coerce.number().min(1).max(65535).default(8020),
+  TRIGGER_WORKLOAD_API_PORT_EXTERNAL: z.coerce.number().min(1).max(65535).default(8020),
   TRIGGER_WARM_START_URL: z.string().optional(),
   TRIGGER_CHECKPOINT_URL: z.string().optional(),
-  TRIGGER_DEQUEUE_INTERVAL_MS: z.coerce.number().int().default(1000),
+  TRIGGER_DEQUEUE_INTERVAL_MS: z.coerce.number().int().min(1).default(1000),

22-22: Consider making OTEL_EXPORTER_OTLP_ENDPOINT optional or providing a default.

The OTEL_EXPORTER_OTLP_ENDPOINT is required without a default value, which could make local development more difficult. Consider making it optional or providing a sensible default for development environments.

-  OTEL_EXPORTER_OTLP_ENDPOINT: z.string().url(),
+  OTEL_EXPORTER_OTLP_ENDPOINT: z.string().url().optional(),

Alternatively, if OpenTelemetry is always required:

-  OTEL_EXPORTER_OTLP_ENDPOINT: z.string().url(),
+  OTEL_EXPORTER_OTLP_ENDPOINT: z.string().url().default("http://localhost:4318"),
apps/supervisor/src/workloadManager/docker.ts (1)

47-53: Error handling could be improved

While the method logs Docker command failures, it doesn't propagate the error to the caller or return a failure status. Consider enhancing the error handling to:

  1. Return a structured response indicating success/failure
  2. Provide more context about the specific error for debugging
  3. Consider retries for transient failures
-  async create(opts: WorkloadManagerCreateOptions) {
+  async create(opts: WorkloadManagerCreateOptions): Promise<{ success: boolean; error?: string; runnerId?: string }> {
     this.logger.log("[DockerWorkloadProvider] Creating container", { opts });

     const runnerId = RunnerId.generate();
     // ... existing code ...

     try {
       const { stdout, stderr } = await x("docker", runArgs);
       this.logger.debug("[DockerWorkloadProvider] Create succeeded", { stdout, stderr });
+      return { success: true, runnerId };
     } catch (error) {
       this.logger.error("[DockerWorkloadProvider] Create failed:", { opts, error });
+      return { 
+        success: false, 
+        error: error instanceof Error ? error.message : String(error)
+      };
     }
   }
packages/core/src/v3/serverOnly/checkpointClient.ts (1)

106-136: Inconsistent error handling between methods

The restoreRun method has less robust error handling compared to suspendRun. It doesn't:

  1. Log success cases (unlike suspendRun which logs debug on success)
  2. Attempt to validate or parse the response body
  3. Handle JSON parsing errors

Consider standardizing the error handling approach between both methods.

 async restoreRun({
   runFriendlyId,
   snapshotFriendlyId,
   checkpoint,
 }: {
   runFriendlyId: string;
   snapshotFriendlyId: string;
   checkpoint: NonNullable<DequeuedMessage["checkpoint"]>;
 }): Promise<boolean> {
   const res = await fetch(this.restoreUrl, {
     method: "POST",
     headers: {
       "Content-Type": "application/json",
     },
     body: JSON.stringify({
       type: "DOCKER",
       containerId: checkpoint.location,
     } satisfies CheckpointServiceRestoreRequestBodyInput),
   });

   if (!res.ok) {
     this.logger.error("[CheckpointClient] Restore request failed", {
       runFriendlyId,
       snapshotFriendlyId,
       checkpoint,
     });
     return false;
   }

+  this.logger.debug("[CheckpointClient] Restore request success", {
+    runFriendlyId,
+    snapshotFriendlyId,
+    location: checkpoint.location,
+    status: res.status,
+  });
+
   return true;
 }
apps/supervisor/src/clients/kubernetes.ts (2)

4-4: Use a configurable logger instead of console.log
While this console.log call is harmless in development, consider using a structured logger for maintainability and to keep logging output consistent across environments.

 function getKubeConfig() {
-  console.log("getKubeConfig()", { RUNTIME_ENV });
+  logger.debug("getKubeConfig()", { RUNTIME_ENV });

25-34: Provide a more descriptive fallback or default case
Using assertExhaustive ensures no unseen environment strings pass, but consider providing a meaningful error message or fallback to prevent abrupt runtime errors if unexpected values occur.

 default:
-  assertExhaustive(RUNTIME_ENV);
+  throw new Error(`Unsupported runtime environment: ${RUNTIME_ENV}`);
apps/supervisor/src/workloadManager/kubernetes.ts (3)

22-22: Consider making the namespace configurable
Hardcoding "default" as the namespace might limit flexibility in multi-namespace setups. Consider accepting a namespace parameter or environment variable.

- private namespace = "default";
+ private readonly namespace: string;
...
constructor(private opts: WorkloadManagerOptions) {
  this.namespace = opts.namespace ?? "default";
  ...
}

60-60: Document ephemeral storage usage in container spec
Including ephemeral storage in resource requests and limits is great. Ensure that your deployment’s storageclass or ephemeral storage solution is fully configured, so pods do not fail due to ephemeral storage constraints.


60-117: Validate successful pod creation and report useful errors
Currently, you log the creation attempt and catch errors, but no success logging is present. Consider adding a log statement or status check after createNamespacedPod.

 await this.k8s.core.createNamespacedPod({
   namespace: this.namespace,
   ...
 });
+this.logger.log("[KubernetesWorkloadManager] Pod created successfully", {
+  runnerId,
+  namespace: this.namespace,
+  image: opts.image,
+});
packages/core/src/v3/serverOnly/httpServer.ts (5)

67-78: Implement trace or debug logging for request details
You are logging at the info/error level, but you might gain insights with additional debug or trace logs for request payloads, headers, or more granular access logs.


92-97: Return more descriptive error body for unhandled routes
Currently, only an empty 404 response is returned. Consider returning a JSON body or text message notifying the client the route does not exist.

 if (!route) {
-  return reply.empty(404);
+  return reply.json({ error: "Route Not Found" }, false, 404);
 }

104-111: Clarify method mismatch response
For a route that exists but does not support a certain method, returning an empty 405 might be unclear to clients. A text or JSON response could improve discoverability.

 if (!routeDefinition) {
-  return reply.empty(405);
+  return reply.text(`Method ${httpMethod.data} not allowed for route ${route}`, 405);
 }

131-142: Expose Zod validation errors to the caller if appropriate
For an invalid body, you simply respond with "Invalid body". Dev or test environments might benefit from seeing the entire validation error. Consider returning partial or detailed error data, or logging it in debug mode.


228-259: Confirm URL path parsing logic for dynamic routes
The route matching compares segment lengths strictly. This may cause edge cases if you desire routes with optional parts or trailing slashes. Evaluate if that is acceptable or if you need a more flexible matching system.

apps/supervisor/src/index.ts (3)

24-24: Consider renaming the logger context to align with the class name.

You're using "managed-worker" as the logger context inside the ManagedSupervisor class. While this is harmless, changing it to something more aligned with the class name (e.g., "managed-supervisor") might improve clarity when filtering logs.


66-73: Consider making the skipDequeue threshold configurable.

At lines 72–73, runs are skipped if either CPU or memory availability is below 0.25. This value is hard-coded and might not be optimal in all environments. Making these thresholds configurable via environment variables or application settings could offer more fine-grained control over resource usage.


202-236: Add a try/catch block around the fetch call for warm starts.

If fetch to the warm start URL fails (for example, due to network errors), it could throw an unhandled exception. Wrapping the request in a try/catch block would prevent unhandled rejections and allow for more graceful error logging or fallback handling.

apps/supervisor/src/workloadServer/index.ts (2)

107-135: Return more descriptive error responses.

When run attempts fail to start or complete, you return an empty 500 response (reply.empty(500)). Including a JSON body with an error message could help clients diagnose issues more quickly.

- console.error("Failed to start run");
- reply.empty(500);
+ console.error("Failed to start run", { error: startResponse.error });
+ reply.json(
+   {
+     ok: false,
+     error: startResponse.error?.message ?? "Failed to start run",
+   },
+   false,
+   500
+ );

196-257: Provide a clearer suspend operation lifecycle.

Currently, you immediately return 202 with { ok: true } before the container is actually suspended. This is a valid asynchronous pattern, but consider returning a link to check status or only returning success after confirming suspension, guiding clients on how to verify it.

apps/supervisor/src/resourceMonitor.ts (4)

22-31: Consider concurrency safety and negative resource availability checks.

Currently, cachedResources can be decremented without a corresponding check to ensure it doesn't go negative. Concurrent calls to blockResources might race, especially if multiple threads/processes or asynchronous tasks attempt to block resources at once. Consider adding locking or other concurrency controls, and ensure resource counters never drop below zero in multi-threaded scenarios.


103-152: Optimize container inspection for large Docker fleets.

In DockerResourceMonitor.getNodeResources, container inspection happens sequentially in a loop. If there are many containers, this can become a bottleneck. Consider using Promise.all to parallelize these inspections, or gathering necessary container metadata in a single API call if the Docker API permits.


154-209: Assess scalability for cluster-wide queries.

KubernetesResourceMonitor.getNodeResources retrieves all running pods on a node. For clusters with a high pod count, this could be large. Confirm that the listPodForAllNamespaces call won't cause performance issues and consider using more targeted queries or caching if necessary.


211-246: Handle additional suffixes and validate parseability.

The Kubernetes parser handles “Ki”, “Mi”, and “Gi” memory suffixes and “m” for CPU. However, Kubernetes can also provide larger or different suffixes (e.g., “Ti”, uppercase variants, etc.). Consider implementing logic for these, or at least ensuring unrecognized suffixes are handled gracefully. Also note that parseInt may discard decimal portions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf4c4d7 and eedc52b.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (23)
  • apps/coordinator/src/checkpointer.ts (1 hunks)
  • apps/supervisor/.env.example (1 hunks)
  • apps/supervisor/.nvmrc (1 hunks)
  • apps/supervisor/README.md (1 hunks)
  • apps/supervisor/package.json (1 hunks)
  • apps/supervisor/src/clients/kubernetes.ts (1 hunks)
  • apps/supervisor/src/env.ts (1 hunks)
  • apps/supervisor/src/index.ts (1 hunks)
  • apps/supervisor/src/resourceMonitor.ts (1 hunks)
  • apps/supervisor/src/util.ts (1 hunks)
  • apps/supervisor/src/workloadManager/docker.ts (1 hunks)
  • apps/supervisor/src/workloadManager/kubernetes.ts (1 hunks)
  • apps/supervisor/src/workloadManager/types.ts (1 hunks)
  • apps/supervisor/src/workloadServer/index.ts (1 hunks)
  • apps/supervisor/tsconfig.json (1 hunks)
  • packages/cli-v3/src/entryPoints/managed-run-controller.ts (2 hunks)
  • packages/core/package.json (3 hunks)
  • packages/core/src/v3/checkpoints/index.ts (0 hunks)
  • packages/core/src/v3/schemas/checkpoints.ts (1 hunks)
  • packages/core/src/v3/schemas/index.ts (1 hunks)
  • packages/core/src/v3/serverOnly/checkpointClient.ts (1 hunks)
  • packages/core/src/v3/serverOnly/httpServer.ts (1 hunks)
  • packages/core/src/v3/serverOnly/index.ts (1 hunks)
💤 Files with no reviewable changes (1)
  • packages/core/src/v3/checkpoints/index.ts
✅ Files skipped from review due to trivial changes (7)
  • packages/core/src/v3/serverOnly/index.ts
  • apps/supervisor/tsconfig.json
  • apps/supervisor/package.json
  • apps/coordinator/src/checkpointer.ts
  • apps/supervisor/.nvmrc
  • apps/supervisor/.env.example
  • apps/supervisor/README.md
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: typecheck / typecheck
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (18)
packages/core/src/v3/schemas/index.ts (1)

15-15: Added export for checkpoint schemas looks good.

This new export properly follows the established pattern in the file and aligns with the PR's objective of restructuring the checkpoint functionality as part of replacing providers and coordinators.

apps/supervisor/src/util.ts (1)

26-30: RunnerId configuration looks good.

The configuration uses an appropriate alphabet (excluding easily confused characters like '0'), sufficient length for uniqueness, and a clear prefix.

apps/supervisor/src/workloadManager/types.ts (1)

3-6: Well-structured interfaces with clear requirements.

Both WorkloadManagerOptions and WorkloadManagerCreateOptions are well-designed with appropriate required and optional properties. The comprehensive set of identifiers in WorkloadManagerCreateOptions provides good context for workload creation.

Also applies to: 12-26

packages/cli-v3/src/entryPoints/managed-run-controller.ts (2)

694-700: Required warm start URL enhances reliability

The code now requires TRIGGER_WARM_START_URL to be explicitly defined rather than falling back to constructing a URL from TRIGGER_WORKER_API_URL. This approach provides better clarity and reliability by:

  1. Making the warm start dependency explicit
  2. Failing fast with a clear error message when the required environment variable isn't available
  3. Gracefully shutting down with exit code 0 when warm starts are disabled

This change supports the PR's goal of replacing the coordinator and providers with a more structured approach.


714-715: Documentation for future enhancement

This TODO comment indicates a planned improvement to retrieve configuration values from the warm start service rather than using environment variables directly. This aligns with good practice of centralizing configuration management.

packages/core/package.json (3)

62-62: New serverOnly export addresses client bundle concerns

Adding the ./v3/serverOnly export to the module mapping is the first step in implementing the PR's goal of creating a dedicated server-only interface to prevent client bundle issues in the future.


181-182: Type definitions for new serverOnly module

The addition of the TypeScript type definitions for the serverOnly module ensures proper type safety when importing from this new module path.


666-675: Complete implementation of serverOnly module exports

The export configuration ensures the new serverOnly module is properly accessible through both ESM and CommonJS imports with the correct type definitions. This implementation effectively replaces the previous checkpoints export while maintaining consistent module structure.

apps/supervisor/src/workloadManager/docker.ts (2)

11-15: Well-structured Docker workload manager implementation

The DockerWorkloadManager class correctly implements the WorkloadManager interface and uses structured logging for improved observability. The constructor accepts configuration options, following good OOP practices.


16-43: Comprehensive Docker container configuration

The create method effectively builds a Docker container configuration with:

  1. Network settings
  2. Essential environment variables for the Trigger.dev ecosystem
  3. Conditional warm start URL configuration
  4. Optional resource constraints based on machine presets

The implementation is thorough and well-structured, accounting for different deployment scenarios.

packages/core/src/v3/serverOnly/checkpointClient.ts (2)

15-32: Well-designed client for checkpoint operations

The CheckpointClient class provides a clean API for interacting with the checkpoint service. It:

  1. Uses a structured logger for consistent logging
  2. Encapsulates URL construction with getter methods
  3. Takes dependencies via constructor injection for better testability

This design follows good software engineering principles.


33-104: Comprehensive suspend operation with error handling

The suspendRun method:

  1. Builds the appropriate request with callback URLs
  2. Sends a POST request to the checkpoint service
  3. Validates the response against a schema
  4. Returns a boolean indicating success
  5. Logs errors at appropriate detail levels

The implementation handles network errors, invalid responses, and schema validation failures correctly.

apps/supervisor/src/workloadManager/kubernetes.ts (2)

12-13: Ensure ephemeral storage is adequate for workloads
You are requesting 2Gi and limiting to 10Gi of ephemeral storage by default. Verify these values meet your workloads’ usage patterns, especially for builds or large intermediate files.


152-162: Handle unknown environment types
While you map PRODUCTION, STAGING, DEVELOPMENT, and PREVIEW, you don’t guard against additional or unexpected environment types at runtime.

  #envTypeToLabelValue(type: EnvironmentType) {
    switch (type) {
      case "PRODUCTION":
        return "prod";
      case "STAGING":
        return "stg";
      case "DEVELOPMENT":
        return "dev";
      case "PREVIEW":
        return "preview";
+     default:
+       logger.warn(`Unrecognized environment type: ${type}`);
+       return "unknown";
    }
  }
apps/supervisor/src/index.ts (1)

41-45: Verify that passing an empty string as the Kubernetes namespace is intended.

In the KubernetesResourceMonitor(createK8sApi(), "") call, an empty string for the namespace might cause the resource monitor to fall back to default namespaces. If this is unintended, consider parameterizing or reading the namespace from the environment.

Would you like me to generate a shell script to search for references to this namespace usage?

packages/core/src/v3/schemas/checkpoints.ts (1)

1-59: All schemas and type definitions appear well-structured and clear.

These Zod schemas comprehensively validate checkpoint requests and responses. The discriminated union for Docker vs. Kubernetes smoothly handles distinct container types, and the callback URL transformation is succinct. Nice work!

apps/supervisor/src/workloadServer/index.ts (1)

402-505: Confirm that only one run can share a socket at any time.

The logic in runConnected removes any existing run from the socket. If two runs with the same ID or the same socket concurrency is expected, ensure no race conditions occur when connecting or disconnecting. Otherwise, you might inadvertently override or remove a run association.

apps/supervisor/src/resourceMonitor.ts (1)

41-51: Validate requested resources before blocking.

blockResources subtracts CPU and memory from cpuAvailable and memoryAvailable without checking if there's enough capacity. This could lead to negative resource values. To provide safer assumptions for downstream logic, consider validating (e.g., returning early or throwing an error) if the capacity is insufficient.

@nicktrn nicktrn merged commit 94fb532 into main Mar 6, 2025
10 of 14 checks passed
@nicktrn nicktrn deleted the feat/supervisor branch March 6, 2025 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants