Skip to content

Commit ce91ac2

Browse files
LoCoBench Botclaude
andcommitted
fix: navprove path mismatch + pytest config — root cause of 0% scoring
Two infrastructure bugs caused all 18 task-config pairs to fail: 1. Instruction/verifier path mismatch: instructions said `/workspace/regression_test` (no extension) but verifiers expected `/workspace/regression_test.py` (with extension). Agents interpreted the path as a directory name and created subdirectories. Fixed all 9 instruction.md files to specify exact file paths with extensions. 2. Pytest config interference: qutebrowser's pytest.ini caused verifier crashes (deprecated --strict flag, unrecognized --timeout). Fixed 4 qutebrowser test.sh files to use `-c /dev/null` to isolate from project config. Also added directory fallback in find_and_prove_verifier.sh — if agent creates a directory instead of a file, verifier searches for test files inside it rather than failing immediately. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 1c09788 commit ce91ac2

File tree

15 files changed

+45
-25
lines changed

15 files changed

+45
-25
lines changed

benchmarks/ccb_navprove/CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ This suite tests your ability to navigate a large codebase, locate a reported bu
1515

1616
## Output Requirements
1717

18-
Write your regression test to `/workspace/regression_test.{ext}` (use the appropriate extension for the project language — `.py`, `.go`, or `.test.ts`).
18+
Write your regression test as a **single file** at `/workspace/regression_test.{ext}` (use the appropriate extension for the project language — `.py`, `.go`, or `.test.ts`). Do NOT create a directory — write a single file directly at that path.
1919

2020
Your regression test MUST:
2121
1. **Import or invoke** the buggy component directly

benchmarks/ccb_navprove/_shared/find_and_prove_verifier.sh

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,29 @@ write_score() {
3737
}
3838

3939
# --- Edge case: agent test does not exist or is empty ---
40+
# Fallback: if agent created a directory instead of a file, look for test files inside
4041
if [[ ! -f "$AGENT_TEST_PATH" ]]; then
41-
echo "ERROR: Agent test not found at $AGENT_TEST_PATH" >> "$SUMMARY_LOG"
42-
write_score "0.0"
42+
# Strip extension to get potential directory name
43+
DIR_PATH="${AGENT_TEST_PATH%.*}"
44+
# Remove .test suffix if present (e.g., regression_test.test.ts -> regression_test)
45+
DIR_PATH="${DIR_PATH%.test}"
46+
FOUND_TEST=""
47+
if [[ -d "$DIR_PATH" ]]; then
48+
echo "NOTE: $AGENT_TEST_PATH not found as file, searching directory $DIR_PATH/" >> "$SUMMARY_LOG"
49+
# Look for test files matching common patterns
50+
for pattern in "regression_test.*" "test_*" "*_test.*" "*.test.*"; do
51+
FOUND_TEST=$(find "$DIR_PATH" -maxdepth 1 -name "$pattern" -type f | head -1)
52+
[[ -n "$FOUND_TEST" ]] && break
53+
done
54+
if [[ -n "$FOUND_TEST" ]]; then
55+
echo "NOTE: Using fallback test file: $FOUND_TEST" >> "$SUMMARY_LOG"
56+
AGENT_TEST_PATH="$FOUND_TEST"
57+
fi
58+
fi
59+
if [[ ! -f "$AGENT_TEST_PATH" ]]; then
60+
echo "ERROR: Agent test not found at $AGENT_TEST_PATH" >> "$SUMMARY_LOG"
61+
write_score "0.0"
62+
fi
4363
fi
4464

4565
if [[ ! -s "$AGENT_TEST_PATH" ]]; then

benchmarks/ccb_navprove/navprove-ansible-vault-001/instruction.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ These issues affect `ansible-galaxy collection install` when processing collecti
2020
## Your Task
2121

2222
1. Investigate the codebase to find the root cause of the tar directory extraction fragility
23-
2. Write a regression test at `/workspace/regression_test` (Python)
24-
3. Your test must be self-contained and runnable with `python3 -m pytest --timeout=60`
23+
2. Write a regression test as a single file at `/workspace/regression_test.py`
24+
3. Your test must be self-contained and runnable with `python3 -m pytest --timeout=60 /workspace/regression_test.py`
2525

2626
## Constraints
2727

benchmarks/ccb_navprove/navprove-flipt-cache-001/instruction.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@ Specifically:
1616
## Your Task
1717

1818
1. Investigate the codebase to find the root cause of these authentication limitations
19-
2. Write a regression test at `/workspace/regression_test` (Go)
20-
3. Your test must be self-contained and runnable with `go test -run TestRegression -v -timeout 60s`
19+
2. Write a regression test as a single file at `/workspace/regression_test.go`
20+
3. Your test must be self-contained and runnable with `go test -run TestRegression -v -timeout 60s /workspace/regression_test.go`
2121

2222
## Constraints
2323

benchmarks/ccb_navprove/navprove-qb-bookmark-001/instruction.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ The same bug affects HSVA notation (HSV with alpha channel). For example, `hsva(
1919
## Your Task
2020

2121
1. Investigate the codebase to find the root cause of the incorrect hue scaling
22-
2. Write a regression test at `/workspace/regression_test` (Python)
23-
3. Your test must be self-contained and runnable with `python3 -m pytest --timeout=60`
22+
2. Write a regression test as a single file at `/workspace/regression_test.py`
23+
3. Your test must be self-contained and runnable with `python3 -m pytest -c /dev/null --timeout=60 /workspace/regression_test.py`
2424

2525
## Constraints
2626

benchmarks/ccb_navprove/navprove-qb-bookmark-001/tests/test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# Sources the shared find_and_prove_verifier to run 2-phase majority-of-3 verification.
44

55
export AGENT_TEST_PATH="/workspace/regression_test.py"
6-
export TEST_COMMAND="python3 -m pytest --timeout=60"
6+
export TEST_COMMAND="python3 -m pytest -c /dev/null --timeout=60"
77
export REFERENCE_PATCH="/tests/reference_fix.patch"
88
export PATCH_APPLY_DIR="/workspace"
99

benchmarks/ccb_navprove/navprove-qb-download-001/instruction.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@ This is particularly problematic because cache corruption can happen for reasons
1616
## Your Task
1717

1818
1. Investigate the codebase to find the root cause of the unhandled exception during cache loading
19-
2. Write a regression test at `/workspace/regression_test` (Python)
20-
3. Your test must be self-contained and runnable with `python3 -m pytest --timeout=60`
19+
2. Write a regression test as a single file at `/workspace/regression_test.py`
20+
3. Your test must be self-contained and runnable with `python3 -m pytest -c /dev/null --timeout=60 /workspace/regression_test.py`
2121

2222
## Constraints
2323

benchmarks/ccb_navprove/navprove-qb-download-001/tests/test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# Sources the shared find_and_prove_verifier to run 2-phase majority-of-3 verification.
44

55
export AGENT_TEST_PATH="/workspace/regression_test.py"
6-
export TEST_COMMAND="python3 -m pytest --timeout=60"
6+
export TEST_COMMAND="python3 -m pytest -c /dev/null --timeout=60"
77
export REFERENCE_PATCH="/tests/reference_fix.patch"
88
export PATCH_APPLY_DIR="/workspace"
99

benchmarks/ccb_navprove/navprove-qb-tab-001/instruction.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@ This only affects the text/foreground threshold setting. Other dark mode setting
1414
## Your Task
1515

1616
1. Investigate the codebase to find the root cause of the ignored threshold setting
17-
2. Write a regression test at `/workspace/regression_test` (Python)
18-
3. Your test must be self-contained and runnable with `python3 -m pytest --timeout=60`
17+
2. Write a regression test as a single file at `/workspace/regression_test.py`
18+
3. Your test must be self-contained and runnable with `python3 -m pytest -c /dev/null --timeout=60 /workspace/regression_test.py`
1919

2020
## Constraints
2121

benchmarks/ccb_navprove/navprove-qb-tab-001/tests/test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# Sources the shared find_and_prove_verifier to run 2-phase majority-of-3 verification.
44

55
export AGENT_TEST_PATH="/workspace/regression_test.py"
6-
export TEST_COMMAND="python3 -m pytest --timeout=60"
6+
export TEST_COMMAND="python3 -m pytest -c /dev/null --timeout=60"
77
export REFERENCE_PATCH="/tests/reference_fix.patch"
88
export PATCH_APPLY_DIR="/workspace"
99

0 commit comments

Comments
 (0)