Skip to content

Commit 83edf34

Browse files
sjarmakclaude
andcommitted
docs: clarify artifact config pairing for MCP-unique suites
SDLC suites use -direct configs (code changes), MCP-unique suites use -artifact configs (answer.json scored against oracle). Added explicit warnings, comparison table rows, and config pairing section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent b2c152e commit 83edf34

File tree

4 files changed

+156
-18
lines changed

4 files changed

+156
-18
lines changed

AGENTS.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -58,12 +58,17 @@ Use these defaults unless there is a task-specific reason not to.
5858

5959
## Evaluation Configs
6060
Config names encode three dimensions: `{agent}-{source}-{verifier}`.
61-
Standard pairing: **baseline-local-direct** (full local code, no MCP) and
62-
**mcp-remote-direct** (source deleted, Sourcegraph MCP). Artifact evaluation
63-
uses **baseline-local-artifact** + **mcp-remote-artifact** (review.json output).
64-
MCP configs use `Dockerfile.sg_only` or `Dockerfile.artifact_only` so the
65-
agent must discover code via MCP tools. The verifier clones the mirror repo
66-
at verification time and overlays agent changes before scoring.
61+
62+
**SDLC suites** (`ccb_build`, `ccb_debug`, etc.): use **baseline-local-direct**
63+
+ **mcp-remote-direct**. Agent produces code changes; verifier checks git diffs.
64+
65+
**MCP-unique suites** (`ccb_mcp_*`): use **baseline-local-artifact** +
66+
**mcp-remote-artifact**. Agent produces `answer.json`; verifier scores against
67+
oracle. Never use `-direct` configs for MCP-unique suites.
68+
69+
MCP configs use `Dockerfile.sg_only` (direct) or `Dockerfile.artifact_only`
70+
(artifact) so the agent must discover code via MCP tools. The verifier clones
71+
the mirror repo at verification time and overlays agent changes before scoring.
6772
See `docs/CONFIGS.md` for the full config matrix.
6873

6974
## Standard Workflow

CLAUDE.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -60,12 +60,17 @@ Use these defaults unless there is a task-specific reason not to.
6060

6161
## Evaluation Configs
6262
Config names encode three dimensions: `{agent}-{source}-{verifier}`.
63-
Standard pairing: **baseline-local-direct** (full local code, no MCP) and
64-
**mcp-remote-direct** (source deleted, Sourcegraph MCP). Artifact evaluation
65-
uses **baseline-local-artifact** + **mcp-remote-artifact** (review.json output).
66-
MCP configs use `Dockerfile.sg_only` or `Dockerfile.artifact_only` so the
67-
agent must discover code via MCP tools. The verifier clones the mirror repo
68-
at verification time and overlays agent changes before scoring.
63+
64+
**SDLC suites** (`ccb_build`, `ccb_debug`, etc.): use **baseline-local-direct**
65+
+ **mcp-remote-direct**. Agent produces code changes; verifier checks git diffs.
66+
67+
**MCP-unique suites** (`ccb_mcp_*`): use **baseline-local-artifact** +
68+
**mcp-remote-artifact**. Agent produces `answer.json`; verifier scores against
69+
oracle. Never use `-direct` configs for MCP-unique suites.
70+
71+
MCP configs use `Dockerfile.sg_only` (direct) or `Dockerfile.artifact_only`
72+
(artifact) so the agent must discover code via MCP tools. The verifier clones
73+
the mirror repo at verification time and overlays agent changes before scoring.
6974
See `docs/CONFIGS.md` for the full config matrix.
7075

7176
## Standard Workflow

docs/CONFIGS.md

Lines changed: 121 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,23 @@ Config names encode three independent dimensions:
2424
|---|---|---|---|---|---|
2525
| `baseline-local-direct` | No MCP | Full source | Git changes | `none` | Original |
2626
| `mcp-remote-direct` | MCP | Source deleted | Git changes | `sourcegraph_full` | `Dockerfile.sg_only` |
27+
| `mcp-scip-remote-direct` | MCP + SCIP | Source deleted | Git changes | `sourcegraph_full` | `Dockerfile.sg_only` |
2728
| `baseline-local-artifact` | No MCP | Full source | `review.json` | `none` | `Dockerfile.artifact_only` |
2829
| `mcp-remote-artifact` | MCP | Source deleted | `review.json` | `artifact_full` | `Dockerfile.artifact_only` |
30+
| `mcp-scip-remote-artifact` | MCP + SCIP | Source deleted | `review.json` | `artifact_full` | `Dockerfile.artifact_only` |
2931

30-
**Standard SDLC suites** use `baseline-local-direct` + `mcp-remote-direct`.
31-
**Artifact evaluation** uses `baseline-local-artifact` + `mcp-remote-artifact`
32-
(set via `FULL_CONFIG=mcp-remote-artifact`).
32+
**Standard SDLC suites** (`ccb_build`, `ccb_debug`, etc.) use
33+
`baseline-local-direct` + `mcp-remote-direct`. The agent produces code changes
34+
and the verifier checks git diffs / test results.
35+
36+
**MCP-unique suites** (`ccb_mcp_*`) use `baseline-local-artifact` +
37+
`mcp-remote-artifact`. These are retrieval/analysis tasks — the agent produces
38+
`/workspace/answer.json` and the verifier scores it against an oracle. Do NOT
39+
run MCP-unique suites with `-direct` configs; the verifier expects an artifact,
40+
not code changes.
41+
42+
**SCIP ablation** uses `mcp-scip-remote-direct` or `mcp-scip-remote-artifact`
43+
(requires branch swap pre-flight; see SCIP Ablation section below).
3344

3445
### Legacy Names
3546

@@ -247,11 +258,116 @@ This flag is only meaningful when used with `--selection-file`.
247258

248259
| Feature | Standard suites | MCP-unique suites |
249260
|---------|----------------|-------------------|
261+
| **Config pair** | `baseline-local-direct` + `mcp-remote-direct` | `baseline-local-artifact` + `mcp-remote-artifact` |
250262
| Selection file | `selected_benchmark_tasks.json` | `selected_mcp_unique_tasks.json` |
251263
| Suite prefix | `ccb_<phase>` | `ccb_mcp_<category>` |
264+
| Agent output | Code changes (git diff) | `/workspace/answer.json` |
252265
| Verifier script | `tests/test.sh` | `tests/eval.sh` |
253266
| Oracle format | task-specific | `oracle_answer.json` + `oracle_checks.py` |
254-
| Local repo | full workspace | 1 local_checkout repo only |
255-
| MCP-Full behavior | truncated source | no source clone |
267+
| Baseline Dockerfile | `Dockerfile` (full repo clone) | `Dockerfile` (full repo clone) |
268+
| MCP Dockerfile | `Dockerfile.sg_only` (truncated source) | `Dockerfile.artifact_only` (empty workspace) |
256269

257270
See `docs/MCP_UNIQUE_TASKS.md` for full task authoring and evaluation details.
271+
272+
## SCIP Precise Indexing Ablation
273+
274+
The `mcp-scip-*` configs measure the impact of SCIP precise code intelligence
275+
on MCP-enabled benchmark runs. SCIP provides compiler-accurate go-to-definition
276+
and find-references (vs search-based heuristics on the control branch).
277+
278+
### How It Works
279+
280+
At the **agent/Harbor level**, `mcp-scip-remote-direct` is identical to
281+
`mcp-remote-direct` — same Dockerfile, same MCP tools, same internal
282+
`mcp_type=sourcegraph_full`. The difference is purely **server-side**: the
283+
Sourcegraph instance has SCIP auto-indexing enabled for one branch and disabled
284+
for another.
285+
286+
Two Sourcegraph configuration policies control indexing:
287+
288+
| Policy | Branch | `indexingEnabled` | ID |
289+
|--------|--------|-------------------|-----|
290+
| Benchmarks: Main (No SCIP) | `main` | `false` | `...MTA2Ng==` |
291+
| Benchmarks: SCIP Enabled | `scip-enabled` | `true` | `...MTA2Nw==` |
292+
293+
Both policies target `github.com/sg-benchmarks/*` with `GIT_TREE` type.
294+
295+
### Deep Search Limitation
296+
297+
Deep Search only indexes the **default branch HEAD**. It cannot be pointed at a
298+
specific branch. To ensure Deep Search uses the SCIP-indexed code, the default
299+
branch must be swapped before running benchmarks.
300+
301+
### Pre-Flight: Branch Swap
302+
303+
Before running SCIP-enabled benchmarks, swap the default branch on all
304+
sg-benchmarks repos:
305+
306+
```bash
307+
# Before SCIP runs (mcp-scip-remote-direct):
308+
./scripts/swap_default_branch.sh scip-enabled
309+
# Wait for Sourcegraph to re-index (~30-60 min for full org)
310+
311+
# Before control runs (mcp-remote-direct) or to restore:
312+
./scripts/swap_default_branch.sh main
313+
```
314+
315+
The swap script:
316+
- Patches all 1,592 sg-benchmarks repos via GitHub API (`--parallel 10`)
317+
- Skips repos already set to the target branch
318+
- Skips empty repos without the target branch
319+
- Logs results to `/tmp/scip_branch_swap/`
320+
- Supports `--dry-run` for previewing
321+
322+
### Running the Ablation
323+
324+
```bash
325+
# 1. Swap to SCIP-enabled
326+
./scripts/swap_default_branch.sh scip-enabled
327+
# 2. Wait for indexing to complete
328+
# 3. Run SCIP config
329+
FULL_CONFIG=mcp-scip-remote-direct configs/run_selected_tasks.sh
330+
331+
# 4. Swap back to control
332+
./scripts/swap_default_branch.sh main
333+
# 5. Wait for re-index
334+
# 6. Run standard MCP config
335+
FULL_CONFIG=mcp-remote-direct configs/run_selected_tasks.sh
336+
```
337+
338+
### Comparing Results
339+
340+
Use `compare_configs.py` with both config names to see where SCIP helps/hurts:
341+
342+
```bash
343+
python3 scripts/compare_configs.py --run <run_dir> \
344+
--configs mcp-remote-direct mcp-scip-remote-direct
345+
```
346+
347+
### SCIP Indexing Coverage
348+
349+
Sourcegraph auto-indexing detects languages and runs the appropriate SCIP
350+
indexer per repo:
351+
352+
| Language | Indexer | Example repos |
353+
|----------|---------|---------------|
354+
| Python | `scip-python` | ansible, django, astropy |
355+
| Go | `scip-go` | cilium, autoscaler, argo-cd |
356+
| TypeScript/JS | `scip-typescript` | vscode, cal.com, copilot-arena |
357+
| Java | `scip-java` | camel |
358+
| C++ | `scip-clang` | bustub, curl, log4cxx |
359+
| C# | `scip-dotnet` | aspnetcore, CodeCoverageSummary |
360+
361+
Not all repos may successfully index (complex build setups). Check indexing
362+
status in the Sourcegraph admin UI after swapping branches.
363+
364+
### Branch Creation Script
365+
366+
If new repos are added to sg-benchmarks, create `scip-enabled` branches:
367+
368+
```bash
369+
./scripts/create_scip_branches.sh [--dry-run] [--parallel N]
370+
```
371+
372+
This creates a `scip-enabled` branch pointing to the same commit as `main` HEAD
373+
for all repos in the org. Empty repos are skipped.

docs/MCP_UNIQUE_TASKS.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -270,10 +270,22 @@ Each task has `tests/task_spec.json` with explicit criteria:
270270

271271
## Running Tasks
272272

273+
### Config Pairing
274+
275+
MCP-unique tasks always use **artifact** configs:
276+
277+
- `baseline-local-artifact` — full repos cloned locally, agent writes `answer.json`
278+
- `mcp-remote-artifact` — empty workspace, agent uses MCP tools, writes `answer.json`
279+
280+
Do NOT use `-direct` configs for MCP-unique suites. Direct configs expect code
281+
changes (git diffs); MCP-unique verifiers expect an `answer.json` artifact
282+
scored against `oracle_answer.json`. The runner script auto-selects artifact
283+
configs when launched with `--selection-file configs/selected_mcp_unique_tasks.json`.
284+
273285
### Full Starter Pack
274286

275287
```bash
276-
# Both configs (baseline + MCP-Full)
288+
# Both configs (baseline-local-artifact + mcp-remote-artifact)
277289
configs/run_selected_tasks.sh \
278290
--selection-file configs/selected_mcp_unique_tasks.json \
279291
--parallel 8

0 commit comments

Comments
 (0)