You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
stats: replace SEM z-intervals with 10K-resample bootstrap CIs
Replace normality-assuming SEM z-intervals with percentile bootstrap
CIs (10,000 resamples, seed=42) throughout the white paper and blog
post. This fixes the inconsistency where Appendix A claimed bootstrap
but Section 11 used z-intervals.
Key changes:
- Exclude 1 errored baseline task (openlibrary-solr-boolean-fix-001),
reducing valid pairs from 251 to 250
- Overall delta: +0.047 (95% bootstrap CI: [+0.007, +0.085])
- SDLC delta: -0.019 (CI includes zero, not significant)
- MCP-unique delta: +0.183 (CI excludes zero, significant)
- Fix suite: n=25→24, delta now -0.015 (was +0.012 with errored task)
- Update statistics.py default n_bootstrap from 1000 to 10000
- Add standalone compute_bootstrap_cis.py for reproducible CI generation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/BLOG_POST.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ Tasks are organized by SDLC phase — Understand, Design, Build, Fix, Test, Docu
22
22
23
23
## The Headline: Near-Zero Overall, But the Spread Is the Story
24
24
25
-
After running 251 valid task pairs across all SDLC suites plus 10 MCP-unique suites (170 SDLC + 81 MCP-unique), MCP shows a small but statistically significant positive effect: baseline mean reward 0.591, MCP mean reward 0.640, delta **+0.049** (95% CI: [+0.010, +0.088]).
25
+
After running 250 valid task pairs across all SDLC suites plus 11 MCP-unique suites (169 SDLC + 81 MCP-unique, with 1 baseline infrastructure error excluded from 251 registered tasks), MCP shows a small but statistically significant positive effect: baseline mean reward 0.594, MCP mean reward 0.640, delta **+0.047** (95% bootstrap CI: [+0.007, +0.085]).
26
26
27
27
But that modest average obscures the real story, because the delta swings from **-0.183** to **+0.440** depending on the task type. That spread — from MCP hurting to MCP helping dramatically — is a much more interesting finding than any single number, because it tells you when code intelligence tools matter and when they don't.
28
28
@@ -33,14 +33,14 @@ The strongest SDLC gain is the Understand suite. MCP-unique tasks show a substan
33
33
| Suite | Tasks | Baseline Mean | MCP Mean | Delta |
**MCP-unique tasks** require cross-repository discovery — tracing a vulnerability across an ecosystem of repos, or mapping how a config value propagates through 5 different services. These tasks span 3-20 repositories and specifically measure org-scale information retrieval. The +0.183 delta (95% CI: [+0.115, +0.252]) is the strongest effect in the benchmark. The sub-suite variation is striking: security tasks show **+0.440**, onboarding **+0.337**, org-scale **+0.197**, incident response **+0.177**, while migration (+0.051) and platform (-0.049) are near-flat.
39
+
**MCP-unique tasks** require cross-repository discovery — tracing a vulnerability across an ecosystem of repos, or mapping how a config value propagates through 5 different services. These tasks span 3-20 repositories and specifically measure org-scale information retrieval. The +0.183 delta (95% bootstrap CI: [+0.116, +0.255]) is the strongest effect in the benchmark. The sub-suite variation is striking: security tasks show **+0.440**, onboarding **+0.337**, org-scale **+0.197**, incident response **+0.177**, while migration (+0.051) and platform (-0.049) are near-flat.
40
40
41
-
**Understand tasks** show the strongest SDLC gain at +0.190 (0.661 to 0.851, 95% CI: [+0.024, +0.357]). This was the biggest reversal in the dataset — earlier drafts showed Understand as strongly negative, but that signal was coming from invalid/contaminated runs that were removed and rerun.
41
+
**Understand tasks** show the strongest SDLC gain at +0.190 (0.660 to 0.851, 95% CI: [+0.043, +0.361]). This was the biggest reversal in the dataset — earlier drafts showed Understand as strongly negative, but that signal was coming from invalid/contaminated runs that were removed and rerun.
42
42
43
-
**Documentation tasks** show a modest positive at +0.048 (95% CI: [+0.011, +0.085]). The agent already does well on documentation with full local code (0.847), and MCP nudges it slightly higher.
43
+
**Documentation tasks** show a modest positive at +0.048 (95% CI: [+0.015, +0.088]). The agent already does well on documentation with full local code (0.847), and MCP nudges it slightly higher.
44
44
45
45
## Where MCP Doesn't Help (or Hurts)
46
46
@@ -51,11 +51,11 @@ MCP hurts on **Debug** (-0.183) and **Build** (-0.121). **Design** (-0.036) and
51
51
| Debug | 20 | 0.670 | 0.487 |**-0.183**|
52
52
| Build | 25 | 0.494 | 0.372 | -0.121 |
53
53
| Design | 20 | 0.753 | 0.718 | -0.036 |
54
-
| Secure | 20 | 0.670 | 0.659 | -0.010 |
54
+
| Fix | 24 | 0.499 | 0.484 | -0.015 |
55
+
| Secure | 20 | 0.669 | 0.659 | -0.010 |
55
56
| Test | 20 | 0.480 | 0.480 | +0.000 |
56
-
| Fix | 25 | 0.479 | 0.491 | +0.012 |
57
57
58
-
The **Debug** result is the clearest negative signal: MCP underperforms baseline by -0.183 (95% CI: [-0.304, -0.062], excludes zero). Build is also materially negative (-0.121). These are local execution-and-modification workflows, and adding a remote retrieval layer does not reliably help the agent get to the actual code change faster or better.
58
+
The **Debug** result is the clearest negative signal: MCP underperforms baseline by -0.183 (95% CI: [-0.301, -0.067], excludes zero). Build is also materially negative (-0.121). These are local execution-and-modification workflows, and adding a remote retrieval layer does not reliably help the agent get to the actual code change faster or better.
59
59
60
60
Fix tasks have the lowest MCP tool ratio of any suite (35% of tool calls use MCP tools) and the highest local tool call count (39.8 per task). Bug-fixing is editing work. The agent needs to read a stack trace, find the offending code, change it, and run the tests. The relevant context is usually local. Adding a remote search layer to that workflow doesn't help — it just adds latency and another thing to do before getting to the actual fix.
61
61
@@ -115,7 +115,7 @@ The February 6th QA audit found 28 issues (9 critical) in the benchmark infrastr
115
115
116
116
## What I Don't Know Yet
117
117
118
-
I want to be clear about the limitations. These are results from a single agent (Claude Code), a single MCP provider (Sourcegraph), running Claude Haiku 4.5. The sample sizes are meaningful (251 valid pairs) and the overall effect is statistically significant (95% CI excludes zero), but individual sub-suite confidence intervals are wide enough that some suite-level conclusions could shift with more data. Multi-trial evaluation with bootstrap confidence intervals is planned but not yet complete.
118
+
I want to be clear about the limitations. These are results from a single agent (Claude Code), a single MCP provider (Sourcegraph), running Claude Haiku 4.5. The sample sizes are meaningful (250 valid pairs) and the overall effect is statistically significant (95% bootstrap CI excludes zero), but individual sub-suite confidence intervals are wide enough that some suite-level conclusions could shift with more data. Each task has a single trial — there's no within-task variance estimate, so the CIs capture cross-task variability only. Multi-trial evaluation is planned but not yet complete.
119
119
120
120
The moderate correlation between retrieval quality and task outcomes (Spearman r=0.395, p=0.041) confirms that finding the right files helps — but it's not the whole story. What else matters? Is it the structure of the tool output? The way search-first workflows shape the agent's reasoning? Some interaction between retrieval strategy and the agent's existing capabilities? I don't know, and I think the answer matters a lot — both for how we build code intelligence tools and for how we design agent workflows.
121
121
@@ -133,7 +133,7 @@ I started this project because I was drowning in noise. Every tool claims to "su
133
133
134
134
Here's what the data says so far: code intelligence tools provide measurable value on tasks that require **comprehension across large codebases** and **cross-repository discovery**. MCP-unique security tasks show +0.440, onboarding +0.337, understand +0.190. When the bottleneck is finding and understanding scattered context, MCP tools help — and the 95% confidence intervals on these effects exclude zero.
135
135
136
-
They provide mixed value on other SDLC tasks. MCP helps on understand (+0.190) and document (+0.048), is effectively flat on fix (+0.012) and test (+0.000), and hurts on debugging (-0.183) and build (-0.121). When the agent already has full source code and the bottleneck is local execution/editing rather than retrieval, adding a remote search layer can still be overhead. The overall delta across all 251 pairs is +0.049 (95% CI: [+0.010, +0.088]) — a small but statistically significant positive, driven primarily by the MCP-unique tasks where cross-repo discovery is the core challenge.
136
+
They provide mixed value on other SDLC tasks. MCP helps on understand (+0.190) and document (+0.048), is effectively flat on fix (-0.015) and test (+0.000), and hurts on debugging (-0.183) and build (-0.121). When the agent already has full source code and the bottleneck is local execution/editing rather than retrieval, adding a remote search layer can still be overhead. The overall delta across all 250 valid pairs is +0.047 (95% bootstrap CI: [+0.007, +0.085]) — a small but statistically significant positive, driven primarily by the MCP-unique tasks where cross-repo discovery is the core challenge.
137
137
138
138
And there's a third category — tasks where the retrieval metrics are basically the same but outcomes still differ — that I can't fully explain yet and might be the most important one to understand.
0 commit comments