Skip to content

Commit 09784e3

Browse files
sjarmakclaude
andcommitted
feat: add Local File Editing guidance to V4 preamble and prep MCP distraction rerun
- Add "Local File Editing" section to V4_PREAMBLE_TEMPLATE explaining that local source files may be truncated and agent should edit locally after reading remotely via MCP (reduces over-reading distraction pattern) - Add 9 new SDLC tasks to selected_benchmark_tasks.json (157→166 tasks): build(1), document(4), test(2), understand(2) - Rebuild ground_truth_files.json (19 file-level GT entries) - Create rerun_mcp_distracted.sh for targeted rerun of 36 distracted tasks - Add consolidate_staging.py for staging run housekeeping - Add find_mcp_distracted.py for identifying MCP-distracted tasks - Add handoff doc for MCP distraction rerun plan Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 7772afc commit 09784e3

File tree

7 files changed

+1530
-2252
lines changed

7 files changed

+1530
-2252
lines changed

agents/claude_baseline_agent.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,15 @@
111111
112112
{repo_scope}
113113
114+
## Local File Editing
115+
116+
Local source files may be truncated (empty). Use Sourcegraph to *read and understand* code, then *edit local files* based on what you learn. The verifier restores the full codebase and applies your local edits on top.
117+
118+
- **Search/Read remotely:** Use MCP tools to find files, understand patterns, read implementations
119+
- **Edit locally:** Use Edit, Write, and Bash to modify files in your working directory
120+
- **Don't over-read:** Once you understand the pattern, start implementing. Reading 20+ remote files without writing code wastes time.
121+
- **Verify locally:** Run tests with Bash to check your changes
122+
114123
## Tool Selection Logic
115124
116125
**Start here:**

configs/ground_truth_files.json

Lines changed: 393 additions & 2241 deletions
Large diffs are not rendered by default.

configs/rerun_mcp_distracted.sh

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
#!/bin/bash
2+
# Targeted rerun of 36 MCP-distracted tasks (SG_full reward < baseline - 0.10).
3+
#
4+
# Root causes:
5+
# (a) 6 code review tasks — Dockerfile.sg_only bug (defect injection missing)
6+
# (b) 11 doc-gen/understand/debug tasks — genuine mild distraction
7+
# (c) 19 tasks with SG_full=0.0 — likely infra failures (rate limits) + navprove bugs
8+
#
9+
# The V4 preamble now includes "Local File Editing" guidance to reduce over-reading.
10+
# This rerun tests whether the preamble fix improves SG_full scores.
11+
#
12+
# Usage:
13+
# ./configs/rerun_mcp_distracted.sh # all 36 tasks
14+
# ./configs/rerun_mcp_distracted.sh --suite build # only build suite
15+
# ./configs/rerun_mcp_distracted.sh --full-only # SG_full only (skip baseline)
16+
17+
set -e
18+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
19+
20+
# Parse args
21+
SUITE_FILTER=""
22+
EXTRA_ARGS=()
23+
while [[ $# -gt 0 ]]; do
24+
case $1 in
25+
--suite) SUITE_FILTER="$2"; shift 2 ;;
26+
*) EXTRA_ARGS+=("$1"); shift ;;
27+
esac
28+
done
29+
30+
run_suite() {
31+
local suite=$1
32+
shift
33+
local tasks=("$@")
34+
35+
if [ -n "$SUITE_FILTER" ] && [ "$SUITE_FILTER" != "$suite" ]; then
36+
return
37+
fi
38+
39+
echo ""
40+
echo "=========================================="
41+
echo "Rerunning $suite: ${#tasks[@]} distracted tasks"
42+
echo "=========================================="
43+
44+
local task_flags=""
45+
for t in "${tasks[@]}"; do
46+
task_flags="$task_flags --task $t"
47+
done
48+
49+
"$SCRIPT_DIR/${suite}_2config.sh" $task_flags "${EXTRA_ARGS[@]}"
50+
}
51+
52+
# ── build (3 tasks) ──
53+
run_suite build \
54+
flipt-dep-refactor-001 \
55+
rust-subtype-relation-refac-001 \
56+
flink-pricing-window-feat-001
57+
58+
# ── debug (5 tasks) ──
59+
run_suite debug \
60+
envoy-duplicate-headers-debug-001 \
61+
istio-xds-destrul-debug-001 \
62+
qutebrowser-download-regression-prove-001 \
63+
qutebrowser-bookmark-regression-prove-001 \
64+
qutebrowser-tab-regression-prove-001
65+
66+
# ── design (4 tasks) ──
67+
run_suite design \
68+
django-pre-validate-signal-design-001 \
69+
k8s-dra-allocation-impact-001 \
70+
camel-routing-arch-001 \
71+
kafka-flink-streaming-arch-001 \
72+
flipt-protobuf-metadata-design-001
73+
74+
# ── document (5 tasks) ──
75+
run_suite document \
76+
k8s-controller-mgr-doc-gen-001 \
77+
k8s-applyconfig-doc-gen-001 \
78+
envoy-migration-doc-gen-001 \
79+
k8s-clientgo-doc-gen-001 \
80+
k8s-fairqueuing-doc-gen-001
81+
82+
# ── fix (1 task) ──
83+
run_suite fix \
84+
django-modelchoice-fk-fix-001
85+
86+
# ── secure (5 tasks) ──
87+
run_suite secure \
88+
django-policy-enforcement-001 \
89+
curl-cve-triage-001 \
90+
django-sensitive-file-exclusion-001 \
91+
grpcurl-transitive-vuln-001 \
92+
flipt-degraded-context-fix-001 \
93+
django-cross-team-boundary-001
94+
95+
# ── test (7 tasks) ──
96+
run_suite test \
97+
terraform-code-review-001 \
98+
kafka-security-review-001 \
99+
vscode-code-review-001 \
100+
ghost-code-review-001 \
101+
envoy-code-review-001 \
102+
curl-security-review-001 \
103+
pandas-groupby-perf-001 \
104+
test-unitgen-py-001
105+
106+
# ── understand (2 tasks) ──
107+
run_suite understand \
108+
kafka-message-lifecycle-qa-001 \
109+
terraform-state-backend-handoff-001 \
110+
cilium-ebpf-fault-qa-001
111+
112+
echo ""
113+
echo "=========================================="
114+
echo "MCP distraction rerun complete"
115+
echo "=========================================="

configs/selected_benchmark_tasks.json

Lines changed: 209 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"generated_by": "SDLC suite migration from migration_map.json",
66
"generated_date": "2026-02-18",
77
"total_available": 835,
8-
"total_selected": 157,
8+
"total_selected": 166,
99
"migration_source": "migration_map.json (157 mapped tasks across 8 SDLC suites)",
1010
"target_total": 170,
1111
"target_note": "ccb_test and ccb_document target 20 each (see docs/backlog_ccb_test.json, docs/backlog_ccb_document.json)"
@@ -46,33 +46,33 @@
4646
"Debugging": 14,
4747
"Documentation": 13,
4848
"Implementation (bug fix)": 33,
49-
"Implementation (feature)": 25,
49+
"Implementation (feature)": 26,
5050
"Implementation (refactor)": 2,
5151
"Implementation (refactoring)": 2,
5252
"Planning (impact analysis)": 2,
5353
"Refactoring": 4,
54-
"Requirements & Discovery": 38,
54+
"Requirements & Discovery": 44,
5555
"Security review": 3,
56-
"Testing & QA": 12
56+
"Testing & QA": 14
5757
},
5858
"tasks_per_benchmark": {
59-
"ccb_build": 25,
59+
"ccb_build": 26,
6060
"ccb_debug": 20,
6161
"ccb_design": 20,
62-
"ccb_document": 13,
62+
"ccb_document": 17,
6363
"ccb_fix": 25,
6464
"ccb_secure": 20,
65-
"ccb_test": 14,
66-
"ccb_understand": 20
65+
"ccb_test": 16,
66+
"ccb_understand": 22
6767
},
6868
"tasks_per_language": {
6969
"c": 10,
7070
"cpp": 20,
7171
"csharp": 3,
72-
"go": 56,
73-
"java": 16,
72+
"go": 61,
73+
"java": 17,
7474
"javascript": 5,
75-
"python": 33,
75+
"python": 36,
7676
"python,cpp": 1,
7777
"rust": 4,
7878
"typescript": 9
@@ -3533,6 +3533,204 @@
35333533
"context_length_source": "task_metrics_run",
35343534
"files_count": 6,
35353535
"files_count_source": "task_metrics_run"
3536+
},
3537+
{
3538+
"task_id": "cgen-deps-install-001",
3539+
"benchmark": "ccb_build",
3540+
"sdlc_phase": "Implementation (feature)",
3541+
"language": "python",
3542+
"difficulty": "medium",
3543+
"category": "dependency-inference",
3544+
"repo": "",
3545+
"mcp_benefit_score": 0.55,
3546+
"mcp_breakdown": {
3547+
"context_complexity": 0.5,
3548+
"cross_file_deps": 0.4,
3549+
"semantic_search_potential": 0.6,
3550+
"task_category_weight": 0.7
3551+
},
3552+
"selection_rationale": "New SDLC task: dependency inference from DIBench",
3553+
"task_dir": "ccb_build/cgen-deps-install-001",
3554+
"context_length": 500000,
3555+
"context_length_source": "mcp_breakdown_proxy",
3556+
"files_count": 8,
3557+
"files_count_source": "mcp_breakdown_proxy"
3558+
},
3559+
{
3560+
"task_id": "django-composite-field-recover-001",
3561+
"benchmark": "ccb_understand",
3562+
"sdlc_phase": "Requirements & Discovery",
3563+
"language": "python",
3564+
"difficulty": "hard",
3565+
"category": "enterprise_knowledge_fragmentation",
3566+
"repo": "django/django",
3567+
"mcp_benefit_score": 0.85,
3568+
"mcp_breakdown": {
3569+
"context_complexity": 0.9,
3570+
"cross_file_deps": 0.85,
3571+
"semantic_search_potential": 0.8,
3572+
"task_category_weight": 0.85
3573+
},
3574+
"selection_rationale": "New SDLC task: knowledge fragmentation recovery across Django packages",
3575+
"task_dir": "ccb_understand/django-composite-field-recover-001",
3576+
"context_length": 850000,
3577+
"context_length_source": "mcp_breakdown_proxy",
3578+
"files_count": 15,
3579+
"files_count_source": "mcp_breakdown_proxy"
3580+
},
3581+
{
3582+
"task_id": "django-template-inherit-recall-001",
3583+
"benchmark": "ccb_understand",
3584+
"sdlc_phase": "Requirements & Discovery",
3585+
"language": "python",
3586+
"difficulty": "hard",
3587+
"category": "enterprise_institutional_memory",
3588+
"repo": "django/django",
3589+
"mcp_benefit_score": 0.85,
3590+
"mcp_breakdown": {
3591+
"context_complexity": 0.9,
3592+
"cross_file_deps": 0.85,
3593+
"semantic_search_potential": 0.8,
3594+
"task_category_weight": 0.85
3595+
},
3596+
"selection_rationale": "New SDLC task: institutional memory recall for Django template regression",
3597+
"task_dir": "ccb_understand/django-template-inherit-recall-001",
3598+
"context_length": 850000,
3599+
"context_length_source": "mcp_breakdown_proxy",
3600+
"files_count": 12,
3601+
"files_count_source": "mcp_breakdown_proxy"
3602+
},
3603+
{
3604+
"task_id": "docgen-changelog-001",
3605+
"benchmark": "ccb_document",
3606+
"sdlc_phase": "Requirements & Discovery",
3607+
"language": "go",
3608+
"difficulty": "medium",
3609+
"category": "changelog_generation",
3610+
"repo": "hashicorp/terraform",
3611+
"mcp_benefit_score": 0.82,
3612+
"mcp_breakdown": {
3613+
"context_complexity": 0.85,
3614+
"cross_file_deps": 0.75,
3615+
"semantic_search_potential": 0.85,
3616+
"task_category_weight": 0.85
3617+
},
3618+
"selection_rationale": "New SDLC task: changelog generation requiring cross-module change discovery",
3619+
"task_dir": "ccb_document/docgen-changelog-001",
3620+
"context_length": 750000,
3621+
"context_length_source": "mcp_breakdown_proxy",
3622+
"files_count": 10,
3623+
"files_count_source": "mcp_breakdown_proxy"
3624+
},
3625+
{
3626+
"task_id": "docgen-changelog-002",
3627+
"benchmark": "ccb_document",
3628+
"sdlc_phase": "Requirements & Discovery",
3629+
"language": "go",
3630+
"difficulty": "medium",
3631+
"category": "changelog_generation",
3632+
"repo": "flipt-io/flipt",
3633+
"mcp_benefit_score": 0.82,
3634+
"mcp_breakdown": {
3635+
"context_complexity": 0.85,
3636+
"cross_file_deps": 0.75,
3637+
"semantic_search_potential": 0.85,
3638+
"task_category_weight": 0.85
3639+
},
3640+
"selection_rationale": "New SDLC task: release notes generation requiring API change discovery",
3641+
"task_dir": "ccb_document/docgen-changelog-002",
3642+
"context_length": 600000,
3643+
"context_length_source": "mcp_breakdown_proxy",
3644+
"files_count": 10,
3645+
"files_count_source": "mcp_breakdown_proxy"
3646+
},
3647+
{
3648+
"task_id": "docgen-inline-002",
3649+
"benchmark": "ccb_document",
3650+
"sdlc_phase": "Requirements & Discovery",
3651+
"language": "java",
3652+
"difficulty": "hard",
3653+
"category": "inline_docstring_generation",
3654+
"repo": "apache/kafka",
3655+
"mcp_benefit_score": 0.88,
3656+
"mcp_breakdown": {
3657+
"context_complexity": 0.9,
3658+
"cross_file_deps": 0.85,
3659+
"semantic_search_potential": 0.9,
3660+
"task_category_weight": 0.85
3661+
},
3662+
"selection_rationale": "New SDLC task: Javadoc generation requiring thread-safety and performance analysis",
3663+
"task_dir": "ccb_document/docgen-inline-002",
3664+
"context_length": 800000,
3665+
"context_length_source": "mcp_breakdown_proxy",
3666+
"files_count": 12,
3667+
"files_count_source": "mcp_breakdown_proxy"
3668+
},
3669+
{
3670+
"task_id": "docgen-onboard-001",
3671+
"benchmark": "ccb_document",
3672+
"sdlc_phase": "Requirements & Discovery",
3673+
"language": "go",
3674+
"difficulty": "hard",
3675+
"category": "onboarding_guide",
3676+
"repo": "istio/istio",
3677+
"mcp_benefit_score": 0.9,
3678+
"mcp_breakdown": {
3679+
"context_complexity": 0.95,
3680+
"cross_file_deps": 0.85,
3681+
"semantic_search_potential": 0.9,
3682+
"task_category_weight": 0.9
3683+
},
3684+
"selection_rationale": "New SDLC task: onboarding guide requiring cross-package architecture discovery",
3685+
"task_dir": "ccb_document/docgen-onboard-001",
3686+
"context_length": 900000,
3687+
"context_length_source": "mcp_breakdown_proxy",
3688+
"files_count": 15,
3689+
"files_count_source": "mcp_breakdown_proxy"
3690+
},
3691+
{
3692+
"task_id": "test-integration-001",
3693+
"benchmark": "ccb_test",
3694+
"sdlc_phase": "Testing & QA",
3695+
"language": "go",
3696+
"difficulty": "hard",
3697+
"category": "integration-test-authoring",
3698+
"repo": "flipt-io/flipt",
3699+
"mcp_benefit_score": 0.78,
3700+
"mcp_breakdown": {
3701+
"context_complexity": 0.8,
3702+
"cross_file_deps": 0.75,
3703+
"semantic_search_potential": 0.75,
3704+
"task_category_weight": 0.8
3705+
},
3706+
"selection_rationale": "New SDLC task: integration test authoring requiring API endpoint discovery",
3707+
"task_dir": "ccb_test/test-integration-001",
3708+
"context_length": 700000,
3709+
"context_length_source": "mcp_breakdown_proxy",
3710+
"files_count": 10,
3711+
"files_count_source": "mcp_breakdown_proxy"
3712+
},
3713+
{
3714+
"task_id": "test-unitgen-go-001",
3715+
"benchmark": "ccb_test",
3716+
"sdlc_phase": "Testing & QA",
3717+
"language": "go",
3718+
"difficulty": "hard",
3719+
"category": "unit-test-generation",
3720+
"repo": "kubernetes/kubernetes",
3721+
"mcp_benefit_score": 0.8,
3722+
"mcp_breakdown": {
3723+
"context_complexity": 0.85,
3724+
"cross_file_deps": 0.75,
3725+
"semantic_search_potential": 0.8,
3726+
"task_category_weight": 0.8
3727+
},
3728+
"selection_rationale": "New SDLC task: unit test generation requiring function discovery and pattern analysis",
3729+
"task_dir": "ccb_test/test-unitgen-go-001",
3730+
"context_length": 800000,
3731+
"context_length_source": "mcp_breakdown_proxy",
3732+
"files_count": 12,
3733+
"files_count_source": "mcp_breakdown_proxy"
35363734
}
35373735
]
35383736
}

0 commit comments

Comments
 (0)