🤖 ci: reduce nightly bench to twice-weekly, add Daytona sandbox cleanup by ibetitsmike · Pull Request #2690 · coder/mux

ibetitsmike · 2026-02-27T16:56:40Z

Summary

Reduce nightly Terminal Bench from twice daily to twice weekly and add automatic Daytona sandbox cleanup after each benchmark run.

Background

The nightly workflow ran at 10:00 UTC and 22:00 UTC every day, which is excessive for cost and Daytona sandbox quota. Daytona sandboxes leaked when Harbor crashed, timed out, or the workflow was cancelled — auto_delete is disabled by default, so stopped sandboxes persist indefinitely and count toward the 25-sandbox limit.

Implementation

Schedule: Changed cron from 0 10 * * * / 0 22 * * * to Monday 10:00 UTC + Thursday 22:00 UTC. Keeps peak/trough model load spread across different days.
Cleanup step: Added to terminal-bench.yml with if: always() && inputs.env == 'daytona'. Uses the Daytona Python SDK to list and delete all sandboxes. Individual delete failures are logged but don't fail the step.

Generated with mux • Model: anthropic:claude-opus-4-6 • Thinking: xhigh • Cost: $1.33

- Change cron schedule from twice daily (10:00/22:00 UTC every day) to twice weekly: Monday 10:00 UTC + Thursday 22:00 UTC. Keeps peak/trough model load spread while saving costs. - Add post-run cleanup step to terminal-bench workflow that deletes all Daytona sandboxes via the SDK. Runs with if: always() so leaked sandboxes are cleaned up even on failure/cancellation.

ibetitsmike · 2026-02-27T16:56:46Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 673dc62ece

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

.github/workflows/terminal-bench.yml

Snapshot pre-existing sandbox IDs before the benchmark run, then only delete sandboxes that appeared during this job. Prevents parallel jobs (e.g. concurrent smoke tests) from deleting each other's sandboxes.

ibetitsmike · 2026-02-27T17:01:55Z

@codex review

Addressed feedback: cleanup now snapshots pre-existing sandbox IDs before the run and only deletes sandboxes created during this specific job, preventing parallel jobs from interfering with each other.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b1241774a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

.github/workflows/terminal-bench.yml

- Skip cleanup entirely if the pre-run snapshot file is missing (prevents accidental global deletion when the snapshot step fails). - Only delete stopped/errored sandboxes, never ones in started/creating/ starting state. This protects active sandboxes from parallel jobs.

ibetitsmike · 2026-02-27T17:06:20Z

@codex review

Addressed both P1s:

Cleanup now requires the pre-run snapshot file to exist — if it's missing (snapshot step failed), cleanup is skipped entirely instead of falling back to global deletion.
Only stopped/errored sandboxes are deleted. Sandboxes in started/creating/starting state are skipped, protecting active sandboxes from parallel jobs.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c941ddbb0d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

.github/workflows/terminal-bench.yml

Both Daytona steps now use continue-on-error: true so transient API/pip failures never block the benchmark or mark the job as failed.

ibetitsmike · 2026-02-27T17:12:01Z

@codex review

Both Daytona steps now use continue-on-error: true — transient API/pip failures in snapshot or cleanup never block the benchmark or mark the job as failed.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 491eb03cbc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

.github/workflows/terminal-bench.yml

ibetitsmike · 2026-02-27T17:16:46Z

This is actually incorrect — if: always() in GitHub Actions does execute on cancellation (that's its primary purpose). It only skips on hard job timeout (the 240-minute kill), which is an extreme edge case and can't be fixed within a single job regardless. The cleanup handles the most common leak scenarios: Harbor crash, benchmark failure, and workflow cancellation.

ibetitsmike · 2026-02-27T17:17:04Z

@codex review

Resolved the always() comment — if: always() does execute on cancellation in GitHub Actions (that's its purpose). It only skips on hard job timeout (4h kill), which is an extreme edge case that can't be fixed within a single job.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 491eb03cbc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

.github/workflows/terminal-bench.yml

ibetitsmike · 2026-02-27T17:22:41Z

@codex review

Resolved both indentation comments — YAML block scalars (|) strip common leading whitespace before passing to the shell. The Python code inside python3 -c "..." arrives at column 0 after YAML processing, so there's no IndentationError. This is the standard pattern for inline Python in GitHub Actions workflows.

chatgpt-codex-connector · 2026-02-27T17:26:35Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

.github/workflows/terminal-bench.yml Outdated Show resolved Hide resolved

scope Daytona cleanup to run-owned sandboxes

3b12417

Snapshot pre-existing sandbox IDs before the benchmark run, then only delete sandboxes that appeared during this job. Prevents parallel jobs (e.g. concurrent smoke tests) from deleting each other's sandboxes.

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

.github/workflows/terminal-bench.yml Outdated Show resolved Hide resolved

.github/workflows/terminal-bench.yml Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

.github/workflows/terminal-bench.yml Show resolved Hide resolved

.github/workflows/terminal-bench.yml Show resolved Hide resolved

make snapshot + cleanup best-effort with continue-on-error

491eb03

Both Daytona steps now use continue-on-error: true so transient API/pip failures never block the benchmark or mark the job as failed.

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

.github/workflows/terminal-bench.yml Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

.github/workflows/terminal-bench.yml Show resolved Hide resolved

.github/workflows/terminal-bench.yml Show resolved Hide resolved

ibetitsmike added this pull request to the merge queue Feb 27, 2026

Merged via the queue into main with commit 67487a4 Feb 27, 2026
38 of 40 checks passed

ibetitsmike deleted the mike/reduce-nightly-bench-frequency branch February 27, 2026 17:47

Conversation

ibetitsmike commented Feb 27, 2026

Summary

Background

Implementation

Uh oh!

ibetitsmike commented Feb 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ibetitsmike commented Feb 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

ibetitsmike commented Feb 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

ibetitsmike commented Feb 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ibetitsmike commented Feb 27, 2026

Uh oh!

ibetitsmike commented Feb 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

ibetitsmike commented Feb 27, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant