Add paired timing breakdown and align blog cost/timing narrative

sjarmak · sjarmak · commit 4ed88cc5f4a6 · 2026-02-27T21:30:14.000Z
diff --git a/docs/BLOG_POST.md b/docs/BLOG_POST.md
@@ -86,14 +86,14 @@ One finding from the refreshed IR pipeline: the Spearman correlation between ret
 
 ## The Cost Surprise
 
-One finding I didn't expect: MCP runs are cheaper.
+One finding I didn't expect after recomputing the cost section on a strict paired slice: MCP is not cheaper overall.
 
 | Config | Mean Cost/Task | Total Cost |
 |--------|---------------|------------|
-| Baseline | $0.75 | $175.68 |
-| MCP | $0.47 | $97.01 |
+| Baseline | $0.339 | $85.12 |
+| MCP | $0.352 | $88.35 |
 
-MCP-augmented runs cost less on average. The mechanism is straightforward: the truncated-source environment has less local code to read, so the agent processes fewer input tokens. MCP is also consistently faster across every suite, with time reductions ranging from about 29% (test) to 94% (design) in the current paired table.
+Using one consistent method (`task_metrics.cost_usd`, cache-inclusive, same n=251 pairs), MCP is about 3.8% more expensive on average (+$0.013/task). The cost story is suite-dependent: MCP is cheaper in design/document/understand/mcp_unique, and more expensive in build/debug/fix/secure/test. MCP is still much faster overall: wall-clock drops from 1401.9s to 653.0s on average (-53.4%), and agent execution time drops from 1058.3s to 299.3s (-71.7%).
 
 This reframes the value question a bit. On the suites where MCP improves reward (especially MCP-unique and, in the cleaned paired set, Understand), you're getting better results at lower cost and lower latency. On the suites where reward is flat (fix, test), you're getting similar results faster. The clearly bad trade-offs remain debug and build, where the agent is faster but less effective.
 
diff --git a/docs/WHITE_PAPER_REPORT_V2.md b/docs/WHITE_PAPER_REPORT_V2.md
@@ -1053,7 +1053,30 @@ Costs below are recomputed on the same **251 paired tasks** used in Section 11.2
 
 On this paired slice, MCP is **~3.8% higher cost** on average (+$0.013/task), not lower. Cost impact is suite-dependent: MCP is cheaper in `design`, `document`, `understand`, and `mcp_unique`, and more expensive in `build`, `debug`, `fix`, `secure`, and `test`.
 
-### 11.9 Correlation Analysis
+### 11.9 Timing Analysis
+
+Timing below uses the same **251 paired tasks** and compares both end-to-end wall clock and agent execution time from `task_metrics.json`.
+
+| Suite | n | Baseline Wall (s) | MCP Wall (s) | Wall Delta | MCP Faster/Slower |
+|-------|---|-------------------|--------------|------------|-------------------|
+| build | 25 | 2555.9 | 1547.4 | -39.5% | 18 / 7 |
+| debug | 20 | 1115.4 | 409.6 | -63.3% | 16 / 4 |
+| design | 20 | 3478.4 | 615.6 | -82.3% | 20 / 0 |
+| document | 20 | 503.6 | 235.9 | -53.2% | 17 / 3 |
+| fix | 25 | 2143.5 | 1249.8 | -41.7% | 19 / 6 |
+| secure | 20 | 386.2 | 337.2 | -12.7% | 14 / 6 |
+| test | 20 | 402.2 | 315.2 | -21.6% | 15 / 5 |
+| understand | 20 | 805.0 | 295.3 | -63.3% | 16 / 4 |
+| mcp_unique | 81 | 1241.6 | 614.8 | -50.5% | 40 / 41 |
+
+| Metric | n | Baseline Mean (s) | MCP Mean (s) | Delta | MCP Faster/Slower |
+|--------|---|-------------------|--------------|-------|-------------------|
+| Wall clock | 251 | 1401.9 | 653.0 | -53.4% | 175 / 76 |
+| Agent execution | 251 | 1058.3 | 299.3 | -71.7% | 195 / 56 |
+
+MCP is substantially faster overall in both wall-clock and agent-execution time. The one caveat is `mcp_unique`: mean wall-clock is much lower under MCP, but the task-level faster/slower split is nearly even (40/41), indicating a high-variance timing distribution.
+
+### 11.10 Correlation Analysis
 
 | Correlation (Spearman rho) | Value | n | Interpretation |
 |---------------------------|-------|---|---------------|
diff --git a/docs/technical_reports/TECHNICAL_REPORT_V1.md b/docs/technical_reports/TECHNICAL_REPORT_V1.md
@@ -1053,7 +1053,30 @@ Costs below are recomputed on the same **251 paired tasks** used in Section 11.2
 
 On this paired slice, MCP is **~3.8% higher cost** on average (+$0.013/task), not lower. Cost impact is suite-dependent: MCP is cheaper in `design`, `document`, `understand`, and `mcp_unique`, and more expensive in `build`, `debug`, `fix`, `secure`, and `test`.
 
-### 11.9 Correlation Analysis
+### 11.9 Timing Analysis
+
+Timing below uses the same **251 paired tasks** and compares both end-to-end wall clock and agent execution time from `task_metrics.json`.
+
+| Suite | n | Baseline Wall (s) | MCP Wall (s) | Wall Delta | MCP Faster/Slower |
+|-------|---|-------------------|--------------|------------|-------------------|
+| build | 25 | 2555.9 | 1547.4 | -39.5% | 18 / 7 |
+| debug | 20 | 1115.4 | 409.6 | -63.3% | 16 / 4 |
+| design | 20 | 3478.4 | 615.6 | -82.3% | 20 / 0 |
+| document | 20 | 503.6 | 235.9 | -53.2% | 17 / 3 |
+| fix | 25 | 2143.5 | 1249.8 | -41.7% | 19 / 6 |
+| secure | 20 | 386.2 | 337.2 | -12.7% | 14 / 6 |
+| test | 20 | 402.2 | 315.2 | -21.6% | 15 / 5 |
+| understand | 20 | 805.0 | 295.3 | -63.3% | 16 / 4 |
+| mcp_unique | 81 | 1241.6 | 614.8 | -50.5% | 40 / 41 |
+
+| Metric | n | Baseline Mean (s) | MCP Mean (s) | Delta | MCP Faster/Slower |
+|--------|---|-------------------|--------------|-------|-------------------|
+| Wall clock | 251 | 1401.9 | 653.0 | -53.4% | 175 / 76 |
+| Agent execution | 251 | 1058.3 | 299.3 | -71.7% | 195 / 56 |
+
+MCP is substantially faster overall in both wall-clock and agent-execution time. The one caveat is `mcp_unique`: mean wall-clock is much lower under MCP, but the task-level faster/slower split is nearly even (40/41), indicating a high-variance timing distribution.
+
+### 11.10 Correlation Analysis
 
 | Correlation (Spearman rho) | Value | n | Interpretation |
 |---------------------------|-------|---|---------------|