Skip to content

Commit 83bb9c8

Browse files
committed
Deduplicate suite-level official results and preserve history
1 parent efdaf87 commit 83bb9c8

19 files changed

+127037
-45484
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -258,6 +258,10 @@ This writes:
258258
- `docs/official_results/traces/*/trajectory.json` -- bundled raw trajectory traces for GitHub audit
259259
- `docs/official_results/index.html` -- interactive local browser
260260

261+
Suite summaries are deduplicated to the latest result per
262+
`suite + config + task_name`; full historical rows remain in
263+
`official_results.json` under `all_tasks`.
264+
261265
Serve locally:
262266

263267
```bash

docs/OFFICIAL_RESULTS_BROWSER.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ Use this workflow to publish valid official scores with easy-to-view parsed trac
1414
- `docs/official_results/traces/*/trajectory.json` - bundled raw trajectory traces
1515
- `docs/official_results/index.html` - local interactive browser
1616

17+
Suite-level views and top-level summaries are deduplicated to one canonical row
18+
per `suite + config + task_name` (latest by task `started_at`). Full historical
19+
rows are preserved in `data/official_results.json` as `all_tasks`.
20+
1721
## Usage
1822

1923
```bash

docs/official_results/README.md

Lines changed: 21 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,40 +2,43 @@
22

33
This bundle is generated from `runs/official/` and includes only valid scored tasks (`passed`/`failed` with numeric reward).
44

5-
Generated: `2026-02-27T02:17:25.254680+00:00`
5+
Generated: `2026-02-27T02:23:03.814992+00:00`
66

77
## Local Browse
88

99
```bash
1010
python3 scripts/export_official_results.py --serve
1111
```
1212

13+
Suite-level views are deduplicated to the latest row per `suite + config + task_name`.
14+
Historical reruns/backfills remain available in `data/official_results.json` under `all_tasks`.
15+
1316
## Suite/Config Summary
1417

1518
| Suite | Config | Valid Tasks | Mean Reward | Pass Rate |
1619
|---|---|---:|---:|---:|
1720
| [ccb_build](suites/ccb_build.md) | `baseline` | 19 | 0.511 | 0.789 |
18-
| [ccb_build](suites/ccb_build.md) | `baseline-local-direct` | 21 | 0.541 | 0.810 |
21+
| [ccb_build](suites/ccb_build.md) | `baseline-local-direct` | 20 | 0.527 | 0.800 |
1922
| [ccb_build](suites/ccb_build.md) | `mcp` | 25 | 0.372 | 0.640 |
2023
| [ccb_build](suites/ccb_build.md) | `mcp-remote-direct` | 25 | 0.372 | 0.640 |
2124
| [ccb_debug](suites/ccb_debug.md) | `baseline` | 20 | 0.670 | 1.000 |
2225
| [ccb_debug](suites/ccb_debug.md) | `baseline-local-direct` | 20 | 0.670 | 1.000 |
2326
| [ccb_debug](suites/ccb_debug.md) | `mcp` | 20 | 0.487 | 0.600 |
2427
| [ccb_debug](suites/ccb_debug.md) | `mcp-remote-direct` | 20 | 0.487 | 0.600 |
2528
| [ccb_design](suites/ccb_design.md) | `baseline` | 13 | 0.770 | 1.000 |
26-
| [ccb_design](suites/ccb_design.md) | `baseline-local-direct` | 27 | 0.745 | 0.926 |
29+
| [ccb_design](suites/ccb_design.md) | `baseline-local-direct` | 20 | 0.753 | 0.950 |
2730
| [ccb_design](suites/ccb_design.md) | `mcp` | 20 | 0.718 | 1.000 |
2831
| [ccb_design](suites/ccb_design.md) | `mcp-remote-direct` | 20 | 0.718 | 1.000 |
2932
| [ccb_document](suites/ccb_document.md) | `baseline` | 14 | 0.904 | 1.000 |
30-
| [ccb_document](suites/ccb_document.md) | `baseline-local-direct` | 26 | 0.825 | 1.000 |
33+
| [ccb_document](suites/ccb_document.md) | `baseline-local-direct` | 20 | 0.847 | 1.000 |
3134
| [ccb_document](suites/ccb_document.md) | `mcp` | 15 | 0.953 | 1.000 |
3235
| [ccb_document](suites/ccb_document.md) | `mcp-remote-direct` | 25 | 0.802 | 1.000 |
3336
| [ccb_fix](suites/ccb_fix.md) | `baseline` | 17 | 0.535 | 0.706 |
34-
| [ccb_fix](suites/ccb_fix.md) | `baseline-local-direct` | 36 | 0.346 | 0.472 |
37+
| [ccb_fix](suites/ccb_fix.md) | `baseline-local-direct` | 28 | 0.428 | 0.571 |
3538
| [ccb_fix](suites/ccb_fix.md) | `mcp` | 17 | 0.538 | 0.647 |
36-
| [ccb_fix](suites/ccb_fix.md) | `mcp-remote-direct` | 33 | 0.440 | 0.545 |
39+
| [ccb_fix](suites/ccb_fix.md) | `mcp-remote-direct` | 28 | 0.467 | 0.571 |
3740
| [ccb_mcp_compliance](suites/ccb_mcp_compliance.md) | `baseline-local-artifact` | 1 | 0.375 | 1.000 |
38-
| [ccb_mcp_compliance](suites/ccb_mcp_compliance.md) | `baseline-local-direct` | 12 | 0.450 | 0.833 |
41+
| [ccb_mcp_compliance](suites/ccb_mcp_compliance.md) | `baseline-local-direct` | 6 | 0.668 | 1.000 |
3942
| [ccb_mcp_compliance](suites/ccb_mcp_compliance.md) | `mcp-remote-artifact` | 1 | 0.742 | 1.000 |
4043
| [ccb_mcp_compliance](suites/ccb_mcp_compliance.md) | `mcp-remote-direct` | 29 | 0.420 | 0.724 |
4144
| [ccb_mcp_crossorg](suites/ccb_mcp_crossorg.md) | `baseline` | 2 | 0.750 | 1.000 |
@@ -46,21 +49,21 @@ python3 scripts/export_official_results.py --serve
4649
| [ccb_mcp_crossorg](suites/ccb_mcp_crossorg.md) | `mcp-remote-direct` | 4 | 0.718 | 1.000 |
4750
| [ccb_mcp_crossrepo](suites/ccb_mcp_crossrepo.md) | `baseline` | 3 | 0.941 | 1.000 |
4851
| [ccb_mcp_crossrepo](suites/ccb_mcp_crossrepo.md) | `baseline-local-artifact` | 2 | 0.000 | 0.000 |
49-
| [ccb_mcp_crossrepo](suites/ccb_mcp_crossrepo.md) | `baseline-local-direct` | 6 | 0.601 | 0.833 |
52+
| [ccb_mcp_crossrepo](suites/ccb_mcp_crossrepo.md) | `baseline-local-direct` | 5 | 0.721 | 1.000 |
5053
| [ccb_mcp_crossrepo](suites/ccb_mcp_crossrepo.md) | `mcp` | 3 | 0.899 | 1.000 |
5154
| [ccb_mcp_crossrepo](suites/ccb_mcp_crossrepo.md) | `mcp-remote-artifact` | 2 | 0.287 | 1.000 |
5255
| [ccb_mcp_crossrepo](suites/ccb_mcp_crossrepo.md) | `mcp-remote-direct` | 21 | 0.580 | 0.810 |
5356
| [ccb_mcp_domain](suites/ccb_mcp_domain.md) | `baseline-local-artifact` | 3 | 0.000 | 0.000 |
54-
| [ccb_mcp_domain](suites/ccb_mcp_domain.md) | `baseline-local-direct` | 12 | 0.435 | 0.667 |
57+
| [ccb_mcp_domain](suites/ccb_mcp_domain.md) | `baseline-local-direct` | 7 | 0.632 | 1.000 |
5558
| [ccb_mcp_domain](suites/ccb_mcp_domain.md) | `mcp-remote-artifact` | 3 | 0.529 | 1.000 |
5659
| [ccb_mcp_domain](suites/ccb_mcp_domain.md) | `mcp-remote-direct` | 30 | 0.501 | 0.867 |
5760
| [ccb_mcp_incident](suites/ccb_mcp_incident.md) | `baseline` | 1 | 0.500 | 1.000 |
5861
| [ccb_mcp_incident](suites/ccb_mcp_incident.md) | `baseline-local-artifact` | 3 | 0.167 | 0.333 |
59-
| [ccb_mcp_incident](suites/ccb_mcp_incident.md) | `baseline-local-direct` | 10 | 0.500 | 0.700 |
62+
| [ccb_mcp_incident](suites/ccb_mcp_incident.md) | `baseline-local-direct` | 7 | 0.714 | 1.000 |
6063
| [ccb_mcp_incident](suites/ccb_mcp_incident.md) | `mcp` | 1 | 1.000 | 1.000 |
6164
| [ccb_mcp_incident](suites/ccb_mcp_incident.md) | `mcp-remote-artifact` | 3 | 0.782 | 1.000 |
6265
| [ccb_mcp_incident](suites/ccb_mcp_incident.md) | `mcp-remote-direct` | 29 | 0.589 | 0.862 |
63-
| [ccb_mcp_migration](suites/ccb_mcp_migration.md) | `baseline-local-direct` | 19 | 0.658 | 0.842 |
66+
| [ccb_mcp_migration](suites/ccb_mcp_migration.md) | `baseline-local-direct` | 7 | 0.815 | 1.000 |
6467
| [ccb_mcp_migration](suites/ccb_mcp_migration.md) | `mcp-remote-direct` | 34 | 0.342 | 0.647 |
6568
| [ccb_mcp_onboarding](suites/ccb_mcp_onboarding.md) | `baseline` | 3 | 0.639 | 1.000 |
6669
| [ccb_mcp_onboarding](suites/ccb_mcp_onboarding.md) | `baseline-local-artifact` | 4 | 0.000 | 0.000 |
@@ -73,27 +76,27 @@ python3 scripts/export_official_results.py --serve
7376
| [ccb_mcp_org](suites/ccb_mcp_org.md) | `mcp-remote-artifact` | 2 | 0.705 | 1.000 |
7477
| [ccb_mcp_org](suites/ccb_mcp_org.md) | `mcp-remote-direct` | 12 | 0.518 | 1.000 |
7578
| [ccb_mcp_platform](suites/ccb_mcp_platform.md) | `baseline` | 1 | 0.928 | 1.000 |
76-
| [ccb_mcp_platform](suites/ccb_mcp_platform.md) | `baseline-local-direct` | 11 | 0.644 | 0.909 |
79+
| [ccb_mcp_platform](suites/ccb_mcp_platform.md) | `baseline-local-direct` | 4 | 0.676 | 1.000 |
7780
| [ccb_mcp_platform](suites/ccb_mcp_platform.md) | `mcp` | 1 | 0.928 | 1.000 |
7881
| [ccb_mcp_platform](suites/ccb_mcp_platform.md) | `mcp-remote-direct` | 17 | 0.439 | 0.765 |
7982
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `baseline` | 2 | 0.500 | 1.000 |
8083
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `baseline-local-artifact` | 4 | 0.000 | 0.000 |
81-
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `baseline-local-direct` | 7 | 0.564 | 1.000 |
84+
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `baseline-local-direct` | 4 | 0.603 | 1.000 |
8285
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `mcp` | 2 | 0.821 | 1.000 |
8386
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `mcp-remote-artifact` | 4 | 0.777 | 1.000 |
8487
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `mcp-remote-direct` | 16 | 0.705 | 1.000 |
8588
| [ccb_secure](suites/ccb_secure.md) | `baseline` | 18 | 0.688 | 0.944 |
86-
| [ccb_secure](suites/ccb_secure.md) | `baseline-local-direct` | 22 | 0.654 | 0.955 |
89+
| [ccb_secure](suites/ccb_secure.md) | `baseline-local-direct` | 20 | 0.669 | 0.950 |
8790
| [ccb_secure](suites/ccb_secure.md) | `mcp` | 18 | 0.705 | 1.000 |
8891
| [ccb_secure](suites/ccb_secure.md) | `mcp-remote-direct` | 22 | 0.645 | 0.909 |
8992
| [ccb_test](suites/ccb_test.md) | `baseline` | 9 | 0.472 | 0.778 |
90-
| [ccb_test](suites/ccb_test.md) | `baseline-local-direct` | 33 | 0.421 | 0.697 |
93+
| [ccb_test](suites/ccb_test.md) | `baseline-local-direct` | 20 | 0.480 | 0.750 |
9194
| [ccb_test](suites/ccb_test.md) | `mcp` | 8 | 0.555 | 0.625 |
92-
| [ccb_test](suites/ccb_test.md) | `mcp-remote-direct` | 42 | 0.415 | 0.643 |
95+
| [ccb_test](suites/ccb_test.md) | `mcp-remote-direct` | 31 | 0.403 | 0.613 |
9396
| [ccb_understand](suites/ccb_understand.md) | `baseline` | 13 | 0.592 | 0.692 |
94-
| [ccb_understand](suites/ccb_understand.md) | `baseline-local-direct` | 27 | 0.599 | 0.741 |
97+
| [ccb_understand](suites/ccb_understand.md) | `baseline-local-direct` | 20 | 0.660 | 0.800 |
9598
| [ccb_understand](suites/ccb_understand.md) | `mcp` | 13 | 0.841 | 1.000 |
96-
| [ccb_understand](suites/ccb_understand.md) | `mcp-remote-direct` | 27 | 0.728 | 0.889 |
99+
| [ccb_understand](suites/ccb_understand.md) | `mcp-remote-direct` | 20 | 0.851 | 1.000 |
97100

98101
<details>
99102
<summary>Run/Config Summary</summary>

0 commit comments

Comments
 (0)