Skip to content

Commit df67d0a

Browse files
LoCoBench Botclaude
andcommitted
feat: add staging → official run promotion workflow
Runs now land in runs/staging/ by default instead of runs/official/. After validation, use promote_run.py to move to official. - New scripts/promote_run.py: --list, --execute, --force, --all flags Validates flagged_tasks.json (0 criticals gate), checks all tasks have result.json, moves to official, regenerates MANIFEST - Update all 34 config scripts: CATEGORY default official → staging - aggregate_status.py: add --staging flag to scan staging runs - check_infra.py: add staging_dir check - CLAUDE.md: document promotion workflow and promote_run.py Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent ce91ac2 commit df67d0a

38 files changed

+513
-39
lines changed

CLAUDE.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ scripts/
2929
cost_report.py # Token/cost aggregation
3030
sync_task_metadata.py # task.toml vs selection registry reconciliation
3131
archive_run.py # Archive old runs to save disk
32+
promote_run.py # Validate & promote staging runs to official
3233
rerun_failed.py # Generate rerun commands for failed tasks
3334
3435
docs/
@@ -64,15 +65,23 @@ docs/
6465

6566
## Running Tasks
6667

68+
Runs land in `runs/staging/` by default. After validation, promote to `runs/official/`.
69+
6770
```bash
68-
# Run a single benchmark (2 configs: baseline, SG_full)
71+
# Run a benchmark (lands in runs/staging/)
6972
./configs/pytorch_2config.sh
7073

7174
# Run with parallel execution
7275
./configs/pytorch_2config.sh --parallel
7376

74-
# Override parallelism
75-
./configs/pytorch_2config.sh --parallel 4
77+
# Check staging runs
78+
python3 scripts/promote_run.py --list
79+
80+
# Promote validated run to official
81+
python3 scripts/promote_run.py --execute pytorch_opus_20260217_120000
82+
83+
# Skip staging (write directly to official)
84+
CATEGORY=official ./configs/pytorch_2config.sh
7685
```
7786

7887
See [AGENTS.md](AGENTS.md) for parallel execution details and multi-account setup.
@@ -111,6 +120,13 @@ python3 scripts/generate_eval_report.py
111120

112121
# Select benchmark tasks
113122
python3 scripts/select_benchmark_tasks.py
123+
124+
# Promote staging runs to official
125+
python3 scripts/promote_run.py --list # View staging runs
126+
python3 scripts/promote_run.py --execute <run_name> # Promote to official
127+
128+
# Monitor staging runs
129+
python3 scripts/aggregate_status.py --staging
114130
```
115131

116132
## Operational Skills (17)
@@ -180,6 +196,7 @@ MAINTENANCE
180196
|-------|--------|---------|
181197
| `/sync-metadata` | `scripts/sync_task_metadata.py` | Reconcile task.toml vs selected_benchmark_tasks.json, `--fix` to auto-update |
182198
| `/archive-run` | `scripts/archive_run.py` | Move old runs to archive/, optional compression, dry-run by default |
199+
| `/promote-run` | `scripts/promote_run.py` | Validate and promote staging runs to official, regenerate MANIFEST |
183200
| `/reextract-metrics` | `scripts/reextract_all_metrics.py` | Batch re-extract task_metrics.json after extraction bug fixes or schema changes |
184201

185202
### Supporting Scripts

configs/archive/repoqa_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ CONCURRENCY=2
6363
TIMEOUT_MULTIPLIER=10
6464
RUN_BASELINE=true
6565
RUN_FULL=true
66-
CATEGORY="${CATEGORY:-official}"
66+
CATEGORY="${CATEGORY:-staging}"
6767

6868
# Parse arguments
6969
while [[ $# -gt 0 ]]; do

configs/codereview_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ CONCURRENCY=2
6363
TIMEOUT_MULTIPLIER=10
6464
RUN_BASELINE=true
6565
RUN_FULL=true
66-
CATEGORY="${CATEGORY:-official}"
66+
CATEGORY="${CATEGORY:-staging}"
6767

6868
# Parse arguments
6969
while [[ $# -gt 0 ]]; do

configs/crossrepo_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ CONCURRENCY=2
6363
TIMEOUT_MULTIPLIER=10
6464
RUN_BASELINE=true
6565
RUN_FULL=true
66-
CATEGORY="${CATEGORY:-official}"
66+
CATEGORY="${CATEGORY:-staging}"
6767

6868
# Parse arguments
6969
while [[ $# -gt 0 ]]; do

configs/dehinted_rerun_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ CONCURRENCY=2
5555
TIMEOUT_MULTIPLIER=10
5656
RUN_BASELINE=true
5757
RUN_FULL=true
58-
CATEGORY="${CATEGORY:-official}"
58+
CATEGORY="${CATEGORY:-staging}"
5959

6060
# Parse arguments
6161
while [[ $# -gt 0 ]]; do

configs/dependeval_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ CONCURRENCY=2
6666
TIMEOUT_MULTIPLIER=10
6767
RUN_BASELINE=true
6868
RUN_FULL=true
69-
CATEGORY="${CATEGORY:-official}"
69+
CATEGORY="${CATEGORY:-staging}"
7070

7171
# Parse arguments
7272
while [[ $# -gt 0 ]]; do

configs/dibench_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ CONCURRENCY=2
6363
TIMEOUT_MULTIPLIER=10
6464
RUN_BASELINE=true
6565
RUN_FULL=true
66-
CATEGORY="${CATEGORY:-official}"
66+
CATEGORY="${CATEGORY:-staging}"
6767

6868
# Parse arguments
6969
while [[ $# -gt 0 ]]; do

configs/docgen_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ CONCURRENCY=2
6565
TIMEOUT_MULTIPLIER=10
6666
RUN_BASELINE=true
6767
RUN_FULL=true
68-
CATEGORY="${CATEGORY:-official}"
68+
CATEGORY="${CATEGORY:-staging}"
6969
TASK_FILTER=""
7070

7171
# All docgen task IDs — populated by task-creation Ralphs

configs/enterprise_2config.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ CONCURRENCY=2
6363
TIMEOUT_MULTIPLIER=10
6464
RUN_BASELINE=true
6565
RUN_FULL=true
66-
CATEGORY="${CATEGORY:-official}"
66+
CATEGORY="${CATEGORY:-staging}"
6767

6868
# Parse arguments
6969
while [[ $# -gt 0 ]]; do

configs/gapfill_all.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ AGENT_PATH="agents.claude_baseline_agent:BaselineClaudeCodeAgent"
4242
MODEL="${MODEL:-anthropic/claude-opus-4-6}"
4343
CONCURRENCY=2
4444
TIMEOUT_MULTIPLIER=10
45-
CATEGORY="${CATEGORY:-official}"
45+
CATEGORY="${CATEGORY:-staging}"
4646
SELECTION_FILE="$SCRIPT_DIR/selected_benchmark_tasks.json"
4747

4848
RUN_PYTORCH=true

0 commit comments

Comments
 (0)