You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/BLOG_POST.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -86,14 +86,14 @@ One finding from the refreshed IR pipeline: the Spearman correlation between ret
86
86
87
87
## The Cost Surprise
88
88
89
-
One finding I didn't expect: MCP runs are cheaper.
89
+
One finding I didn't expect after recomputing the cost section on a strict paired slice: MCP is not cheaper overall.
90
90
91
91
| Config | Mean Cost/Task | Total Cost |
92
92
|--------|---------------|------------|
93
-
| Baseline | $0.75| $175.68|
94
-
| MCP | $0.47| $97.01|
93
+
| Baseline | $0.339| $85.12|
94
+
| MCP | $0.352| $88.35|
95
95
96
-
MCP-augmented runs cost less on average. The mechanism is straightforward: the truncated-source environment has less local code to read, so the agent processes fewer input tokens. MCP is also consistently faster across every suite, with time reductions ranging from about 29% (test) to 94% (design) in the current paired table.
96
+
Using one consistent method (`task_metrics.cost_usd`, cache-inclusive, same n=251 pairs), MCP is about 3.8% more expensive on average (+$0.013/task). The cost story is suite-dependent: MCP is cheaper in design/document/understand/mcp_unique, and more expensive in build/debug/fix/secure/test. MCP is still much faster overall: wall-clock drops from 1401.9s to 653.0s on average (-53.4%), and agent execution time drops from 1058.3s to 299.3s (-71.7%).
97
97
98
98
This reframes the value question a bit. On the suites where MCP improves reward (especially MCP-unique and, in the cleaned paired set, Understand), you're getting better results at lower cost and lower latency. On the suites where reward is flat (fix, test), you're getting similar results faster. The clearly bad trade-offs remain debug and build, where the agent is faster but less effective.
Copy file name to clipboardExpand all lines: docs/WHITE_PAPER_REPORT_V2.md
+24-1Lines changed: 24 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1053,7 +1053,30 @@ Costs below are recomputed on the same **251 paired tasks** used in Section 11.2
1053
1053
1054
1054
On this paired slice, MCP is **~3.8% higher cost** on average (+$0.013/task), not lower. Cost impact is suite-dependent: MCP is cheaper in`design`, `document`, `understand`, and `mcp_unique`, and more expensive in`build`, `debug`, `fix`, `secure`, and `test`.
1055
1055
1056
-
### 11.9 Correlation Analysis
1056
+
### 11.9 Timing Analysis
1057
+
1058
+
Timing below uses the same **251 paired tasks** and compares both end-to-end wall clock and agent execution time from `task_metrics.json`.
1059
+
1060
+
| Suite | n | Baseline Wall (s) | MCP Wall (s) | Wall Delta | MCP Faster/Slower |
MCP is substantially faster overall in both wall-clock and agent-execution time. The one caveat is `mcp_unique`: mean wall-clock is much lower under MCP, but the task-level faster/slower split is nearly even (40/41), indicating a high-variance timing distribution.
1078
+
1079
+
### 11.10 Correlation Analysis
1057
1080
1058
1081
| Correlation (Spearman rho) | Value | n | Interpretation |
Copy file name to clipboardExpand all lines: docs/technical_reports/TECHNICAL_REPORT_V1.md
+24-1Lines changed: 24 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1053,7 +1053,30 @@ Costs below are recomputed on the same **251 paired tasks** used in Section 11.2
1053
1053
1054
1054
On this paired slice, MCP is **~3.8% higher cost** on average (+$0.013/task), not lower. Cost impact is suite-dependent: MCP is cheaper in`design`, `document`, `understand`, and `mcp_unique`, and more expensive in`build`, `debug`, `fix`, `secure`, and `test`.
1055
1055
1056
-
### 11.9 Correlation Analysis
1056
+
### 11.9 Timing Analysis
1057
+
1058
+
Timing below uses the same **251 paired tasks** and compares both end-to-end wall clock and agent execution time from `task_metrics.json`.
1059
+
1060
+
| Suite | n | Baseline Wall (s) | MCP Wall (s) | Wall Delta | MCP Faster/Slower |
MCP is substantially faster overall in both wall-clock and agent-execution time. The one caveat is `mcp_unique`: mean wall-clock is much lower under MCP, but the task-level faster/slower split is nearly even (40/41), indicating a high-variance timing distribution.
1078
+
1079
+
### 11.10 Correlation Analysis
1057
1080
1058
1081
| Correlation (Spearman rho) | Value | n | Interpretation |
0 commit comments