Skip to content

🤖 ci: reduce nightly bench to twice-weekly, add Daytona sandbox cleanup#2690

Merged
ibetitsmike merged 4 commits intomainfrom
mike/reduce-nightly-bench-frequency
Feb 27, 2026
Merged

🤖 ci: reduce nightly bench to twice-weekly, add Daytona sandbox cleanup#2690
ibetitsmike merged 4 commits intomainfrom
mike/reduce-nightly-bench-frequency

Conversation

@ibetitsmike
Copy link
Contributor

Summary

Reduce nightly Terminal Bench from twice daily to twice weekly and add automatic Daytona sandbox cleanup after each benchmark run.

Background

The nightly workflow ran at 10:00 UTC and 22:00 UTC every day, which is excessive for cost and Daytona sandbox quota. Daytona sandboxes leaked when Harbor crashed, timed out, or the workflow was cancelled — auto_delete is disabled by default, so stopped sandboxes persist indefinitely and count toward the 25-sandbox limit.

Implementation

  • Schedule: Changed cron from 0 10 * * * / 0 22 * * * to Monday 10:00 UTC + Thursday 22:00 UTC. Keeps peak/trough model load spread across different days.
  • Cleanup step: Added to terminal-bench.yml with if: always() && inputs.env == 'daytona'. Uses the Daytona Python SDK to list and delete all sandboxes. Individual delete failures are logged but don't fail the step.

Generated with mux • Model: anthropic:claude-opus-4-6 • Thinking: xhigh • Cost: $1.33

- Change cron schedule from twice daily (10:00/22:00 UTC every day) to
  twice weekly: Monday 10:00 UTC + Thursday 22:00 UTC. Keeps peak/trough
  model load spread while saving costs.
- Add post-run cleanup step to terminal-bench workflow that deletes all
  Daytona sandboxes via the SDK. Runs with if: always() so leaked
  sandboxes are cleaned up even on failure/cancellation.
@ibetitsmike
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 673dc62ece

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Snapshot pre-existing sandbox IDs before the benchmark run, then only
delete sandboxes that appeared during this job. Prevents parallel jobs
(e.g. concurrent smoke tests) from deleting each other's sandboxes.
@ibetitsmike
Copy link
Contributor Author

@codex review

Addressed feedback: cleanup now snapshots pre-existing sandbox IDs before the run and only deletes sandboxes created during this specific job, preventing parallel jobs from interfering with each other.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b1241774a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

- Skip cleanup entirely if the pre-run snapshot file is missing (prevents
  accidental global deletion when the snapshot step fails).
- Only delete stopped/errored sandboxes, never ones in started/creating/
  starting state. This protects active sandboxes from parallel jobs.
@ibetitsmike
Copy link
Contributor Author

@codex review

Addressed both P1s:

  1. Cleanup now requires the pre-run snapshot file to exist — if it's missing (snapshot step failed), cleanup is skipped entirely instead of falling back to global deletion.
  2. Only stopped/errored sandboxes are deleted. Sandboxes in started/creating/starting state are skipped, protecting active sandboxes from parallel jobs.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c941ddbb0d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Both Daytona steps now use continue-on-error: true so transient API/pip
failures never block the benchmark or mark the job as failed.
@ibetitsmike
Copy link
Contributor Author

@codex review

Both Daytona steps now use continue-on-error: true — transient API/pip failures in snapshot or cleanup never block the benchmark or mark the job as failed.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 491eb03cbc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ibetitsmike
Copy link
Contributor Author

This is actually incorrect — if: always() in GitHub Actions does execute on cancellation (that's its primary purpose). It only skips on hard job timeout (the 240-minute kill), which is an extreme edge case and can't be fixed within a single job regardless. The cleanup handles the most common leak scenarios: Harbor crash, benchmark failure, and workflow cancellation.

@ibetitsmike
Copy link
Contributor Author

@codex review

Resolved the always() comment — if: always() does execute on cancellation in GitHub Actions (that's its purpose). It only skips on hard job timeout (4h kill), which is an extreme edge case that can't be fixed within a single job.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 491eb03cbc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ibetitsmike
Copy link
Contributor Author

@codex review

Resolved both indentation comments — YAML block scalars (|) strip common leading whitespace before passing to the shell. The Python code inside python3 -c "..." arrives at column 0 after YAML processing, so there's no IndentationError. This is the standard pattern for inline Python in GitHub Actions workflows.

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ibetitsmike ibetitsmike added this pull request to the merge queue Feb 27, 2026
Merged via the queue into main with commit 67487a4 Feb 27, 2026
38 of 40 checks passed
@ibetitsmike ibetitsmike deleted the mike/reduce-nightly-bench-frequency branch February 27, 2026 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant