🤖 ci: reduce nightly bench to twice-weekly, add Daytona sandbox cleanup#2690
🤖 ci: reduce nightly bench to twice-weekly, add Daytona sandbox cleanup#2690ibetitsmike merged 4 commits intomainfrom
Conversation
- Change cron schedule from twice daily (10:00/22:00 UTC every day) to twice weekly: Monday 10:00 UTC + Thursday 22:00 UTC. Keeps peak/trough model load spread while saving costs. - Add post-run cleanup step to terminal-bench workflow that deletes all Daytona sandboxes via the SDK. Runs with if: always() so leaked sandboxes are cleaned up even on failure/cancellation.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 673dc62ece
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Snapshot pre-existing sandbox IDs before the benchmark run, then only delete sandboxes that appeared during this job. Prevents parallel jobs (e.g. concurrent smoke tests) from deleting each other's sandboxes.
|
@codex review Addressed feedback: cleanup now snapshots pre-existing sandbox IDs before the run and only deletes sandboxes created during this specific job, preventing parallel jobs from interfering with each other. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3b1241774a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
- Skip cleanup entirely if the pre-run snapshot file is missing (prevents accidental global deletion when the snapshot step fails). - Only delete stopped/errored sandboxes, never ones in started/creating/ starting state. This protects active sandboxes from parallel jobs.
|
@codex review Addressed both P1s:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c941ddbb0d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Both Daytona steps now use continue-on-error: true so transient API/pip failures never block the benchmark or mark the job as failed.
|
@codex review Both Daytona steps now use |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 491eb03cbc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
This is actually incorrect — |
|
@codex review Resolved the |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 491eb03cbc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review Resolved both indentation comments — YAML block scalars ( |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Summary
Reduce nightly Terminal Bench from twice daily to twice weekly and add automatic Daytona sandbox cleanup after each benchmark run.
Background
The nightly workflow ran at 10:00 UTC and 22:00 UTC every day, which is excessive for cost and Daytona sandbox quota. Daytona sandboxes leaked when Harbor crashed, timed out, or the workflow was cancelled —
auto_deleteis disabled by default, so stopped sandboxes persist indefinitely and count toward the 25-sandbox limit.Implementation
0 10 * * */0 22 * * *to Monday 10:00 UTC + Thursday 22:00 UTC. Keeps peak/trough model load spread across different days.terminal-bench.ymlwithif: always() && inputs.env == 'daytona'. Uses the Daytona Python SDK to list and delete all sandboxes. Individual delete failures are logged but don't fail the step.Generated with
mux• Model:anthropic:claude-opus-4-6• Thinking:xhigh• Cost:$1.33