Skip to content

Commit b7d30e2

Browse files
sjarmakclaude
andcommitted
fix: repair 8 MCP-unique verifier routes and hydrate 7 empty oracles
- Replace 8 dual-mode test.sh (branching on .artifact_only_mode) with always-eval pattern that calls eval.sh directly, matching the 73 other MCP-unique tasks. The dual-mode pattern routed to placeholder direct_verifier.sh stubs that unconditionally returned reward=0. - Hydrate task_spec.json oracle arrays for 7 tasks from ground_truth.json: compliance-115 (3 files), compliance-118 (5), dep-trace-116 (3), domain-120 (21), migration-114 (11), migration-117 (16), platform-119 (13). - Fix migration-022 instruction-oracle mismatch: instruction asked about removed Kafka producer configs (block.on.buffer.full etc.) but oracle was redesigned to test @deprecated annotations. Updated instruction.md and instruction_mcp.md to match the oracle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 20b2411 commit b7d30e2

File tree

18 files changed

+746
-213
lines changed

18 files changed

+746
-213
lines changed

benchmarks/ccb_mcp_compliance/ccx-compliance-115/tests/task_spec.json

Lines changed: 50 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,35 +5,70 @@
55
"category": "F",
66
"mcp_suite": "ccb_mcp_compliance",
77
"prd": {
8-
"user_story": "As a developer, I want to: Audit Django's session framework for concurrency safety in the session key rotation path. Find: 1. The Python source file in `django/contrib/sessions/` that implements `cycle_key()` — the method called during login to rotate session keys. 2. The session backend base class file that defines `create()` and `_get_new_session_key()` — the methods responsible for generating and persisting new session keys. 3. The database backend file that implements the actual `create()` with a database INSERT. 4. Identify whether `create()` handles key collisions (duplicate session keys) or silently overwrites. Report the repo, file path, class name, and method name for each, plus a brief note on whether collision handling exists.",
9-
"constraints": ["Provide specific file paths and repository names in your answer.", "Write your findings to /workspace/answer.json."],
8+
"user_story": "As a developer, I want to: Audit Django's session framework for concurrency safety in the session key rotation path. Find: 1. The Python source file in `django/contrib/sessions/` that implements `cycle_key()` \u2014 the method called during login to rotate session keys. 2. The session backend base class file that defines `create()` and `_get_new_session_key()` \u2014 the methods responsible for generating and persisting new session keys. 3. The database backend file that implements the actual `create()` with a database INSERT. 4. Identify whether `create()` handles key collisions (duplicate session keys) or silently overwrites. Report the repo, file path, class name, and method name for each, plus a brief note on whether collision handling exists.",
9+
"constraints": [
10+
"Provide specific file paths and repository names in your answer.",
11+
"Write your findings to /workspace/answer.json."
12+
],
1013
"success_definition": "Agent successfully identifies relevant files and symbols across all repos in the django-web-framework fixture.",
11-
"seed_prompt": "Audit Django's session framework for concurrency safety in the session key rotation path. Find: 1. The Python source file in `django/contrib/sessions/` that implements `cycle_key()` the method called during login to rotate session keys. 2. The session backend base class file that defines `create()` and `_get_new_session_key()` the methods responsible for generating and persisting new session keys. 3. The database backend file that implements the actual `create()` with a database INSERT. 4. Identify whether `create()` handles key collisions (duplicate session keys) or silently overwrites. Report the repo, file path, class name, and method name for each, plus a brief note on whether collision handling exists."
14+
"seed_prompt": "Audit Django's session framework for concurrency safety in the session key rotation path. Find: 1. The Python source file in `django/contrib/sessions/` that implements `cycle_key()` \u2014 the method called during login to rotate session keys. 2. The session backend base class file that defines `create()` and `_get_new_session_key()` \u2014 the methods responsible for generating and persisting new session keys. 3. The database backend file that implements the actual `create()` with a database INSERT. 4. Identify whether `create()` handles key collisions (duplicate session keys) or silently overwrites. Report the repo, file path, class name, and method name for each, plus a brief note on whether collision handling exists."
1215
},
1316
"artifacts": {
1417
"repo_set_id": "django-web-framework",
1518
"oracle": {
16-
"required_files": [],
19+
"required_files": [
20+
{
21+
"repo": "django/django",
22+
"path": "django/contrib/sessions/backends/base.py"
23+
},
24+
{
25+
"repo": "django/django",
26+
"path": "django/contrib/sessions/backends/db.py"
27+
},
28+
{
29+
"repo": "django/django",
30+
"path": "django/contrib/auth/__init__.py"
31+
}
32+
],
1733
"required_symbols": [],
1834
"required_references": [],
19-
"dependency_chains": []
35+
"dependency_chains": [
36+
{
37+
"steps": [
38+
{
39+
"repo": "django/django",
40+
"path": "django/contrib/sessions/backends/base.py"
41+
},
42+
{
43+
"repo": "django/django",
44+
"path": "django/contrib/sessions/backends/db.py"
45+
}
46+
]
47+
}
48+
]
2049
}
2150
},
2251
"evaluation": {
23-
"modes": ["deterministic"],
52+
"modes": [
53+
"deterministic"
54+
],
2455
"checks": [
25-
{
26-
"type": "file_set_match",
27-
"params": {
28-
"search_pattern": "",
29-
"file_filter": ""
30-
}
31-
}
32-
],
56+
{
57+
"type": "file_set_match",
58+
"params": {
59+
"search_pattern": "",
60+
"file_filter": ""
61+
}
62+
}
63+
],
3364
"eval_script": "/tests/eval.sh",
3465
"pass_exit_code": 0
3566
},
3667
"logging": {
37-
"required_metrics": ["oracle_coverage", "time_to_first_oracle_hit_ms", "unique_repos_touched"]
68+
"required_metrics": [
69+
"oracle_coverage",
70+
"time_to_first_oracle_hit_ms",
71+
"unique_repos_touched"
72+
]
3873
}
3974
}
Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,9 @@
11
#!/bin/bash
2-
# test.sh — Dual-mode verifier dispatcher
3-
# Artifact mode (.artifact_only_mode exists): run eval.sh (oracle scoring)
4-
# Direct mode (default / .sg_only_mode): run direct_verifier.sh
5-
6-
set -e
2+
# test.sh — Harbor compatibility wrapper
3+
# Harbor requires tests/test.sh for task discovery (TaskPaths.is_valid() check).
4+
# The actual evaluation logic lives in eval.sh (SWE-Factory exit-code-first pattern).
75

86
# sg_only_env: restore full repo before verification (no-op for regular runs)
97
[ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ] && source /tests/sgonly_verifier_wrapper.sh
108

11-
if [ -f /tmp/.artifact_only_mode ]; then
12-
echo "[test.sh] Artifact mode detected -> running eval.sh (oracle verifier)"
13-
exec bash "$(dirname "$0")/eval.sh" "$@"
14-
else
15-
echo "[test.sh] Direct mode -> running direct_verifier.sh"
16-
exec bash "$(dirname "$0")/direct_verifier.sh" "$@"
17-
fi
9+
exec bash "$(dirname "$0")/eval.sh" "$@"

benchmarks/ccb_mcp_compliance/ccx-compliance-118/tests/task_spec.json

Lines changed: 62 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,35 +5,82 @@
55
"category": "F",
66
"mcp_suite": "ccb_mcp_compliance",
77
"prd": {
8-
"user_story": "As a developer, I want to: Audit Django's admin filter rendering pipeline to identify where empty related-field filters are constructed. Find: 1. The Python source file in `django/contrib/admin/` that defines `RelatedFieldListFilter` — the class responsible for rendering ForeignKey filter dropdowns in the admin sidebar. 2. The file that defines the base `ListFilter` class and its `has_output()` method that determines whether a filter should be displayed. 3. The admin `ChangeList` class file that collects and renders filters, calling `has_output()` for each. 4. The template tag or view file that iterates over filters in the sidebar. For each, report the file path, class name, and the specific method that controls filter visibility.",
9-
"constraints": ["Provide specific file paths and repository names in your answer.", "Write your findings to /workspace/answer.json."],
8+
"user_story": "As a developer, I want to: Audit Django's admin filter rendering pipeline to identify where empty related-field filters are constructed. Find: 1. The Python source file in `django/contrib/admin/` that defines `RelatedFieldListFilter` \u2014 the class responsible for rendering ForeignKey filter dropdowns in the admin sidebar. 2. The file that defines the base `ListFilter` class and its `has_output()` method that determines whether a filter should be displayed. 3. The admin `ChangeList` class file that collects and renders filters, calling `has_output()` for each. 4. The template tag or view file that iterates over filters in the sidebar. For each, report the file path, class name, and the specific method that controls filter visibility.",
9+
"constraints": [
10+
"Provide specific file paths and repository names in your answer.",
11+
"Write your findings to /workspace/answer.json."
12+
],
1013
"success_definition": "Agent successfully identifies relevant files and symbols across all repos in the django-web-framework fixture.",
11-
"seed_prompt": "Audit Django's admin filter rendering pipeline to identify where empty related-field filters are constructed. Find: 1. The Python source file in `django/contrib/admin/` that defines `RelatedFieldListFilter` the class responsible for rendering ForeignKey filter dropdowns in the admin sidebar. 2. The file that defines the base `ListFilter` class and its `has_output()` method that determines whether a filter should be displayed. 3. The admin `ChangeList` class file that collects and renders filters, calling `has_output()` for each. 4. The template tag or view file that iterates over filters in the sidebar. For each, report the file path, class name, and the specific method that controls filter visibility."
14+
"seed_prompt": "Audit Django's admin filter rendering pipeline to identify where empty related-field filters are constructed. Find: 1. The Python source file in `django/contrib/admin/` that defines `RelatedFieldListFilter` \u2014 the class responsible for rendering ForeignKey filter dropdowns in the admin sidebar. 2. The file that defines the base `ListFilter` class and its `has_output()` method that determines whether a filter should be displayed. 3. The admin `ChangeList` class file that collects and renders filters, calling `has_output()` for each. 4. The template tag or view file that iterates over filters in the sidebar. For each, report the file path, class name, and the specific method that controls filter visibility."
1215
},
1316
"artifacts": {
1417
"repo_set_id": "django-web-framework",
1518
"oracle": {
16-
"required_files": [],
19+
"required_files": [
20+
{
21+
"repo": "django/django",
22+
"path": "django/contrib/admin/filters.py"
23+
},
24+
{
25+
"repo": "django/django",
26+
"path": "django/contrib/admin/views/main.py"
27+
},
28+
{
29+
"repo": "django/django",
30+
"path": "django/contrib/admin/templatetags/admin_list.py"
31+
},
32+
{
33+
"repo": "django/django",
34+
"path": "django/contrib/admin/templates/admin/change_list.html"
35+
},
36+
{
37+
"repo": "django/django",
38+
"path": "django/contrib/admin/templates/admin/filter.html"
39+
}
40+
],
1741
"required_symbols": [],
1842
"required_references": [],
19-
"dependency_chains": []
43+
"dependency_chains": [
44+
{
45+
"steps": [
46+
{
47+
"repo": "django/django",
48+
"path": "django/contrib/admin/views/main.py"
49+
},
50+
{
51+
"repo": "django/django",
52+
"path": "django/contrib/admin/filters.py"
53+
},
54+
{
55+
"repo": "django/django",
56+
"path": "django/contrib/admin/templatetags/admin_list.py"
57+
}
58+
]
59+
}
60+
]
2061
}
2162
},
2263
"evaluation": {
23-
"modes": ["deterministic"],
64+
"modes": [
65+
"deterministic"
66+
],
2467
"checks": [
25-
{
26-
"type": "file_set_match",
27-
"params": {
28-
"search_pattern": "",
29-
"file_filter": ""
30-
}
31-
}
32-
],
68+
{
69+
"type": "file_set_match",
70+
"params": {
71+
"search_pattern": "",
72+
"file_filter": ""
73+
}
74+
}
75+
],
3376
"eval_script": "/tests/eval.sh",
3477
"pass_exit_code": 0
3578
},
3679
"logging": {
37-
"required_metrics": ["oracle_coverage", "time_to_first_oracle_hit_ms", "unique_repos_touched"]
80+
"required_metrics": [
81+
"oracle_coverage",
82+
"time_to_first_oracle_hit_ms",
83+
"unique_repos_touched"
84+
]
3885
}
3986
}
Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,9 @@
11
#!/bin/bash
2-
# test.sh — Dual-mode verifier dispatcher
3-
# Artifact mode (.artifact_only_mode exists): run eval.sh (oracle scoring)
4-
# Direct mode (default / .sg_only_mode): run direct_verifier.sh
5-
6-
set -e
2+
# test.sh — Harbor compatibility wrapper
3+
# Harbor requires tests/test.sh for task discovery (TaskPaths.is_valid() check).
4+
# The actual evaluation logic lives in eval.sh (SWE-Factory exit-code-first pattern).
75

86
# sg_only_env: restore full repo before verification (no-op for regular runs)
97
[ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ] && source /tests/sgonly_verifier_wrapper.sh
108

11-
if [ -f /tmp/.artifact_only_mode ]; then
12-
echo "[test.sh] Artifact mode detected -> running eval.sh (oracle verifier)"
13-
exec bash "$(dirname "$0")/eval.sh" "$@"
14-
else
15-
echo "[test.sh] Direct mode -> running direct_verifier.sh"
16-
exec bash "$(dirname "$0")/direct_verifier.sh" "$@"
17-
fi
9+
exec bash "$(dirname "$0")/eval.sh" "$@"

benchmarks/ccb_mcp_crossrepo_tracing/ccx-dep-trace-116/tests/task_spec.json

Lines changed: 54 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,35 +5,74 @@
55
"category": "A",
66
"mcp_suite": "ccb_mcp_crossrepo_tracing",
77
"prd": {
8-
"user_story": "As a developer, I want to: Trace the `TypeMeta` struct from its usage in the `Pod` type definition to its authoritative definition, following re-exports across Kubernetes repositories. Find: 1. In `kubernetes/kubernetes` staging area: the file `staging/src/k8s.io/api/core/v1/types.go` where `Pod` embeds `metav1.TypeMeta` — report the import alias and import path. 2. In `kubernetes/api` or the api staging module: the file that re-exports `TypeMeta` via the `meta/v1` package. 3. In `kubernetes/apimachinery`: the file `pkg/apis/meta/v1/types.go` where `TypeMeta` is originally defined — report the struct fields (`Kind` and `APIVersion`). For each step, report the repo, file path, line number, and the relevant type/import declaration.",
9-
"constraints": ["Provide specific file paths and repository names in your answer.", "Write your findings to /workspace/answer.json."],
8+
"user_story": "As a developer, I want to: Trace the `TypeMeta` struct from its usage in the `Pod` type definition to its authoritative definition, following re-exports across Kubernetes repositories. Find: 1. In `kubernetes/kubernetes` staging area: the file `staging/src/k8s.io/api/core/v1/types.go` where `Pod` embeds `metav1.TypeMeta` \u2014 report the import alias and import path. 2. In `kubernetes/api` or the api staging module: the file that re-exports `TypeMeta` via the `meta/v1` package. 3. In `kubernetes/apimachinery`: the file `pkg/apis/meta/v1/types.go` where `TypeMeta` is originally defined \u2014 report the struct fields (`Kind` and `APIVersion`). For each step, report the repo, file path, line number, and the relevant type/import declaration.",
9+
"constraints": [
10+
"Provide specific file paths and repository names in your answer.",
11+
"Write your findings to /workspace/answer.json."
12+
],
1013
"success_definition": "Agent successfully identifies relevant files and symbols across all repos in the kubernetes-ecosystem fixture.",
11-
"seed_prompt": "Trace the `TypeMeta` struct from its usage in the `Pod` type definition to its authoritative definition, following re-exports across Kubernetes repositories. Find: 1. In `kubernetes/kubernetes` staging area: the file `staging/src/k8s.io/api/core/v1/types.go` where `Pod` embeds `metav1.TypeMeta` report the import alias and import path. 2. In `kubernetes/api` or the api staging module: the file that re-exports `TypeMeta` via the `meta/v1` package. 3. In `kubernetes/apimachinery`: the file `pkg/apis/meta/v1/types.go` where `TypeMeta` is originally defined report the struct fields (`Kind` and `APIVersion`). For each step, report the repo, file path, line number, and the relevant type/import declaration."
14+
"seed_prompt": "Trace the `TypeMeta` struct from its usage in the `Pod` type definition to its authoritative definition, following re-exports across Kubernetes repositories. Find: 1. In `kubernetes/kubernetes` staging area: the file `staging/src/k8s.io/api/core/v1/types.go` where `Pod` embeds `metav1.TypeMeta` \u2014 report the import alias and import path. 2. In `kubernetes/api` or the api staging module: the file that re-exports `TypeMeta` via the `meta/v1` package. 3. In `kubernetes/apimachinery`: the file `pkg/apis/meta/v1/types.go` where `TypeMeta` is originally defined \u2014 report the struct fields (`Kind` and `APIVersion`). For each step, report the repo, file path, line number, and the relevant type/import declaration."
1215
},
1316
"artifacts": {
1417
"repo_set_id": "kubernetes-ecosystem",
1518
"oracle": {
16-
"required_files": [],
19+
"required_files": [
20+
{
21+
"repo": "kubernetes/kubernetes",
22+
"path": "staging/src/k8s.io/api/core/v1/types.go"
23+
},
24+
{
25+
"repo": "kubernetes/api",
26+
"path": "core/v1/types.go"
27+
},
28+
{
29+
"repo": "kubernetes/apimachinery",
30+
"path": "pkg/apis/meta/v1/types.go"
31+
}
32+
],
1733
"required_symbols": [],
1834
"required_references": [],
19-
"dependency_chains": []
35+
"dependency_chains": [
36+
{
37+
"steps": [
38+
{
39+
"repo": "kubernetes/kubernetes",
40+
"path": "staging/src/k8s.io/api/core/v1/types.go"
41+
},
42+
{
43+
"repo": "kubernetes/api",
44+
"path": "core/v1/types.go"
45+
},
46+
{
47+
"repo": "kubernetes/apimachinery",
48+
"path": "pkg/apis/meta/v1/types.go"
49+
}
50+
]
51+
}
52+
]
2053
}
2154
},
2255
"evaluation": {
23-
"modes": ["deterministic"],
56+
"modes": [
57+
"deterministic"
58+
],
2459
"checks": [
25-
{
26-
"type": "file_set_match",
27-
"params": {
28-
"search_pattern": "",
29-
"file_filter": ""
30-
}
31-
}
32-
],
60+
{
61+
"type": "file_set_match",
62+
"params": {
63+
"search_pattern": "",
64+
"file_filter": ""
65+
}
66+
}
67+
],
3368
"eval_script": "/tests/eval.sh",
3469
"pass_exit_code": 0
3570
},
3671
"logging": {
37-
"required_metrics": ["oracle_coverage", "time_to_first_oracle_hit_ms", "unique_repos_touched"]
72+
"required_metrics": [
73+
"oracle_coverage",
74+
"time_to_first_oracle_hit_ms",
75+
"unique_repos_touched"
76+
]
3877
}
3978
}
Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,9 @@
11
#!/bin/bash
2-
# test.sh — Dual-mode verifier dispatcher
3-
# Artifact mode (.artifact_only_mode exists): run eval.sh (oracle scoring)
4-
# Direct mode (default / .sg_only_mode): run direct_verifier.sh
5-
6-
set -e
2+
# test.sh — Harbor compatibility wrapper
3+
# Harbor requires tests/test.sh for task discovery (TaskPaths.is_valid() check).
4+
# The actual evaluation logic lives in eval.sh (SWE-Factory exit-code-first pattern).
75

86
# sg_only_env: restore full repo before verification (no-op for regular runs)
97
[ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ] && source /tests/sgonly_verifier_wrapper.sh
108

11-
if [ -f /tmp/.artifact_only_mode ]; then
12-
echo "[test.sh] Artifact mode detected -> running eval.sh (oracle verifier)"
13-
exec bash "$(dirname "$0")/eval.sh" "$@"
14-
else
15-
echo "[test.sh] Direct mode -> running direct_verifier.sh"
16-
exec bash "$(dirname "$0")/direct_verifier.sh" "$@"
17-
fi
9+
exec bash "$(dirname "$0")/eval.sh" "$@"

0 commit comments

Comments
 (0)