Skip to content

Commit 4ed88cc

Browse files
committed
Add paired timing breakdown and align blog cost/timing narrative
1 parent d4279d4 commit 4ed88cc

File tree

3 files changed

+52
-6
lines changed

3 files changed

+52
-6
lines changed

docs/BLOG_POST.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -86,14 +86,14 @@ One finding from the refreshed IR pipeline: the Spearman correlation between ret
8686

8787
## The Cost Surprise
8888

89-
One finding I didn't expect: MCP runs are cheaper.
89+
One finding I didn't expect after recomputing the cost section on a strict paired slice: MCP is not cheaper overall.
9090

9191
| Config | Mean Cost/Task | Total Cost |
9292
|--------|---------------|------------|
93-
| Baseline | $0.75 | $175.68 |
94-
| MCP | $0.47 | $97.01 |
93+
| Baseline | $0.339 | $85.12 |
94+
| MCP | $0.352 | $88.35 |
9595

96-
MCP-augmented runs cost less on average. The mechanism is straightforward: the truncated-source environment has less local code to read, so the agent processes fewer input tokens. MCP is also consistently faster across every suite, with time reductions ranging from about 29% (test) to 94% (design) in the current paired table.
96+
Using one consistent method (`task_metrics.cost_usd`, cache-inclusive, same n=251 pairs), MCP is about 3.8% more expensive on average (+$0.013/task). The cost story is suite-dependent: MCP is cheaper in design/document/understand/mcp_unique, and more expensive in build/debug/fix/secure/test. MCP is still much faster overall: wall-clock drops from 1401.9s to 653.0s on average (-53.4%), and agent execution time drops from 1058.3s to 299.3s (-71.7%).
9797

9898
This reframes the value question a bit. On the suites where MCP improves reward (especially MCP-unique and, in the cleaned paired set, Understand), you're getting better results at lower cost and lower latency. On the suites where reward is flat (fix, test), you're getting similar results faster. The clearly bad trade-offs remain debug and build, where the agent is faster but less effective.
9999

docs/WHITE_PAPER_REPORT_V2.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1053,7 +1053,30 @@ Costs below are recomputed on the same **251 paired tasks** used in Section 11.2
10531053
10541054
On this paired slice, MCP is **~3.8% higher cost** on average (+$0.013/task), not lower. Cost impact is suite-dependent: MCP is cheaper in `design`, `document`, `understand`, and `mcp_unique`, and more expensive in `build`, `debug`, `fix`, `secure`, and `test`.
10551055
1056-
### 11.9 Correlation Analysis
1056+
### 11.9 Timing Analysis
1057+
1058+
Timing below uses the same **251 paired tasks** and compares both end-to-end wall clock and agent execution time from `task_metrics.json`.
1059+
1060+
| Suite | n | Baseline Wall (s) | MCP Wall (s) | Wall Delta | MCP Faster/Slower |
1061+
|-------|---|-------------------|--------------|------------|-------------------|
1062+
| build | 25 | 2555.9 | 1547.4 | -39.5% | 18 / 7 |
1063+
| debug | 20 | 1115.4 | 409.6 | -63.3% | 16 / 4 |
1064+
| design | 20 | 3478.4 | 615.6 | -82.3% | 20 / 0 |
1065+
| document | 20 | 503.6 | 235.9 | -53.2% | 17 / 3 |
1066+
| fix | 25 | 2143.5 | 1249.8 | -41.7% | 19 / 6 |
1067+
| secure | 20 | 386.2 | 337.2 | -12.7% | 14 / 6 |
1068+
| test | 20 | 402.2 | 315.2 | -21.6% | 15 / 5 |
1069+
| understand | 20 | 805.0 | 295.3 | -63.3% | 16 / 4 |
1070+
| mcp_unique | 81 | 1241.6 | 614.8 | -50.5% | 40 / 41 |
1071+
1072+
| Metric | n | Baseline Mean (s) | MCP Mean (s) | Delta | MCP Faster/Slower |
1073+
|--------|---|-------------------|--------------|-------|-------------------|
1074+
| Wall clock | 251 | 1401.9 | 653.0 | -53.4% | 175 / 76 |
1075+
| Agent execution | 251 | 1058.3 | 299.3 | -71.7% | 195 / 56 |
1076+
1077+
MCP is substantially faster overall in both wall-clock and agent-execution time. The one caveat is `mcp_unique`: mean wall-clock is much lower under MCP, but the task-level faster/slower split is nearly even (40/41), indicating a high-variance timing distribution.
1078+
1079+
### 11.10 Correlation Analysis
10571080
10581081
| Correlation (Spearman rho) | Value | n | Interpretation |
10591082
|---------------------------|-------|---|---------------|

docs/technical_reports/TECHNICAL_REPORT_V1.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1053,7 +1053,30 @@ Costs below are recomputed on the same **251 paired tasks** used in Section 11.2
10531053
10541054
On this paired slice, MCP is **~3.8% higher cost** on average (+$0.013/task), not lower. Cost impact is suite-dependent: MCP is cheaper in `design`, `document`, `understand`, and `mcp_unique`, and more expensive in `build`, `debug`, `fix`, `secure`, and `test`.
10551055
1056-
### 11.9 Correlation Analysis
1056+
### 11.9 Timing Analysis
1057+
1058+
Timing below uses the same **251 paired tasks** and compares both end-to-end wall clock and agent execution time from `task_metrics.json`.
1059+
1060+
| Suite | n | Baseline Wall (s) | MCP Wall (s) | Wall Delta | MCP Faster/Slower |
1061+
|-------|---|-------------------|--------------|------------|-------------------|
1062+
| build | 25 | 2555.9 | 1547.4 | -39.5% | 18 / 7 |
1063+
| debug | 20 | 1115.4 | 409.6 | -63.3% | 16 / 4 |
1064+
| design | 20 | 3478.4 | 615.6 | -82.3% | 20 / 0 |
1065+
| document | 20 | 503.6 | 235.9 | -53.2% | 17 / 3 |
1066+
| fix | 25 | 2143.5 | 1249.8 | -41.7% | 19 / 6 |
1067+
| secure | 20 | 386.2 | 337.2 | -12.7% | 14 / 6 |
1068+
| test | 20 | 402.2 | 315.2 | -21.6% | 15 / 5 |
1069+
| understand | 20 | 805.0 | 295.3 | -63.3% | 16 / 4 |
1070+
| mcp_unique | 81 | 1241.6 | 614.8 | -50.5% | 40 / 41 |
1071+
1072+
| Metric | n | Baseline Mean (s) | MCP Mean (s) | Delta | MCP Faster/Slower |
1073+
|--------|---|-------------------|--------------|-------|-------------------|
1074+
| Wall clock | 251 | 1401.9 | 653.0 | -53.4% | 175 / 76 |
1075+
| Agent execution | 251 | 1058.3 | 299.3 | -71.7% | 195 / 56 |
1076+
1077+
MCP is substantially faster overall in both wall-clock and agent-execution time. The one caveat is `mcp_unique`: mean wall-clock is much lower under MCP, but the task-level faster/slower split is nearly even (40/41), indicating a high-variance timing distribution.
1078+
1079+
### 11.10 Correlation Analysis
10571080
10581081
| Correlation (Spearman rho) | Value | n | Interpretation |
10591082
|---------------------------|-------|---|---------------|

0 commit comments

Comments
 (0)