Skip to content

Commit f21632e

Browse files
committed
chore: archive locobench benchmark without run outputs and sync ops docs
1 parent 11765ac commit f21632e

28 files changed

+5008
-4
lines changed

AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,6 @@ python3 scripts/generate_eval_report.py
7979
- `configs/_common.sh` - shared run infra (parallelism, token refresh, validation hooks)
8080
- `configs/*_2config.sh` - per-suite run launchers
8181
- `configs/validate_one_per_benchmark.sh --smoke-runtime` - quick no-agent runtime smoke (1 task per benchmark)
82-
- Smoke interpretation: `smoke_verifier_nonzero_with_reward` is acceptable in no-agent mode; use `--smoke-timeout-overrides "ccb_pytorch=900,ccb_tac=900,ccb_crossrepo=900"` for timeout-heavy suites.
82+
- Smoke interpretation: `smoke_verifier_nonzero_with_reward` is acceptable in no-agent mode; use `--smoke-timeout-overrides "ccb_pytorch=1800,ccb_tac=900,ccb_crossrepo=900"` for timeout-heavy suites.
8383
- Timeout diagnostics: `smoke_build_timeout` (image build phase) vs `smoke_verify_timeout` (verifier phase).
8484
- `scripts/promote_run.py` - staging to official promotion flow

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,6 @@ python3 scripts/generate_eval_report.py
7979
- `configs/_common.sh` - shared run infra (parallelism, token refresh, validation hooks)
8080
- `configs/*_2config.sh` - per-suite run launchers
8181
- `configs/validate_one_per_benchmark.sh --smoke-runtime` - quick no-agent runtime smoke (1 task per benchmark)
82-
- Smoke interpretation: `smoke_verifier_nonzero_with_reward` is acceptable in no-agent mode; use `--smoke-timeout-overrides "ccb_pytorch=900,ccb_tac=900,ccb_crossrepo=900"` for timeout-heavy suites.
82+
- Smoke interpretation: `smoke_verifier_nonzero_with_reward` is acceptable in no-agent mode; use `--smoke-timeout-overrides "ccb_pytorch=1800,ccb_tac=900,ccb_crossrepo=900"` for timeout-heavy suites.
8383
- Timeout diagnostics: `smoke_build_timeout` (image build phase) vs `smoke_verify_timeout` (verifier phase).
8484
- `scripts/promote_run.py` - staging to official promotion flow
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Large data files - download on VM
2+
data/
3+
data.zip
4+
5+
# Generated files - regenerate on VM
6+
locobench_dataset.jsonl
7+
8+
# Python
9+
__pycache__/
10+
*.pyc
11+
12+
# Test artifacts
13+
smoke_test_jobs/
14+
15+
# macOS
16+
.DS_Store
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# LoCoBench-Agent Dataset Exploration
2+
3+
This document describes the structure and contents of the LoCoBench-Agent dataset for the Harbor adapter implementation.
4+
5+
## Directory Structure
6+
7+
```
8+
data/
9+
├── generated/ # 1000 synthetic code projects
10+
│ ├── <project_id>/ # e.g., c_api_gateway_easy_009
11+
│ │ ├── <project_name>/ # e.g., EduGate_ScholarLink (actual code files)
12+
│ │ └── project_metadata.json
13+
│ └── ...
14+
15+
└── output/
16+
├── scenarios/ # 8000 task scenario JSON files
17+
│ └── <scenario_id>.json
18+
19+
├── agent_scenarios/ # 8000 extended multi-turn agent scenarios
20+
│ └── <scenario_id>.json
21+
22+
└── validation/
23+
└── test_suites/ # 8000 test suite definitions
24+
└── <scenario_id>_tests.json
25+
```
26+
27+
## Scenario File Format (data/output/scenarios/*.json)
28+
29+
Each scenario file contains a single task definition with the following fields:
30+
31+
| Field | Type | Description |
32+
|-------|------|-------------|
33+
| `id` | string | Unique identifier (e.g., `c_api_gateway_easy_009_architectural_understanding_expert_01`) |
34+
| `task_category` | string | One of 8 categories (see below) |
35+
| `difficulty` | string | `easy`, `medium`, `hard`, or `expert` |
36+
| `title` | string | Human-readable task title |
37+
| `description` | string | Detailed description of the task context and requirements |
38+
| `context_files` | array | List of file paths in the synthetic project (uses `//` as separator) |
39+
| `context_length` | integer | Total token count of all context files |
40+
| `task_prompt` | string | The actual task/question for the agent to solve |
41+
| `expected_approach` | string | How an expert would approach the task |
42+
| `ground_truth` | string or object | Expected answer/solution (format varies by task category) |
43+
| `evaluation_criteria` | array | List of criteria for judging responses |
44+
| `metadata` | object | Additional info including files_count, coverage metrics, timestamp |
45+
46+
### Sample Scenario JSON
47+
48+
```json
49+
{
50+
"id": "c_api_gateway_easy_009_architectural_understanding_expert_01",
51+
"task_category": "architectural_understanding",
52+
"difficulty": "expert",
53+
"title": "Architectural Refactoring for Dynamic Route Configuration",
54+
"description": "EduGate ScholarLink is an API gateway...",
55+
"context_files": [
56+
"EduGate_ScholarLink//src//main.c",
57+
"EduGate_ScholarLink//src//components//router.c",
58+
"EduGate_ScholarLink//include//edugate.h",
59+
...
60+
],
61+
"context_length": 128233,
62+
"task_prompt": "Your task is to analyze the existing architecture...",
63+
"expected_approach": "An expert developer would approach this...",
64+
"ground_truth": "The core of a correct solution involves...",
65+
"evaluation_criteria": [
66+
"**Analysis Correctness:** Accurately identifies...",
67+
"**Architectural Viability:** Proposes a sound...",
68+
...
69+
],
70+
"metadata": {
71+
"context_length": 128233,
72+
"files_count": 11,
73+
"information_coverage": 0.95,
74+
"coverage_range": [0.8, 1.0],
75+
"generation_timestamp": "2025-08-05T15:07:11.561371"
76+
}
77+
}
78+
```
79+
80+
## Task Categories (8 total)
81+
82+
The dataset contains 8 distinct task categories, each representing a different type of software engineering challenge:
83+
84+
1. **architectural_understanding** - Analyze and propose architectural changes or refactoring
85+
2. **bug_investigation** - Identify root causes of bugs from symptoms and propose fixes
86+
3. **code_comprehension** - Understand and explain how existing code works
87+
4. **cross_file_refactoring** - Refactor code that spans multiple files
88+
5. **feature_implementation** - Add new functionality to existing codebase
89+
6. **integration_testing** - Design or implement integration tests
90+
7. **multi_session_development** - Tasks requiring iterative development across sessions
91+
8. **security_analysis** - Identify vulnerabilities and propose security improvements
92+
93+
## Programming Languages (10 total)
94+
95+
Tasks span 10 programming languages, identified by the prefix in the scenario ID:
96+
97+
- `c` - C
98+
- `cpp` - C++
99+
- `csharp` - C#
100+
- `go` - Go
101+
- `java` - Java
102+
- `javascript` - JavaScript
103+
- `php` - PHP
104+
- `python` - Python
105+
- `rust` - Rust
106+
- `typescript` - TypeScript
107+
108+
## Dataset Statistics
109+
110+
- **Total scenarios**: 8,000 task files
111+
- **Synthetic projects**: 1,000 generated codebases
112+
- **Tasks per project**: 8 (one per task category)
113+
- **Difficulty levels**: easy, medium, hard, expert
114+
- **Context length range**: Varies from ~40K to 600K+ tokens
115+
116+
## ID Format Convention
117+
118+
Scenario IDs follow the pattern:
119+
```
120+
{language}_{domain}_{complexity}_{project_num}_{task_category}_{difficulty}_{variant}
121+
```
122+
123+
Example: `python_api_gateway_expert_045_bug_investigation_hard_01`
124+
- Language: `python`
125+
- Domain: `api_gateway`
126+
- Project complexity: `expert`
127+
- Project number: `045`
128+
- Task category: `bug_investigation`
129+
- Task difficulty: `hard`
130+
- Variant: `01`
131+
132+
## Extended Agent Scenarios (data/output/agent_scenarios/)
133+
134+
The `agent_scenarios/` folder contains extended versions of each scenario designed for multi-turn agent evaluation. These include:
135+
136+
- `scenario_id` - Matches the base scenario
137+
- `conversation_phases` - Structured phases for agent interaction:
138+
1. **exploration** - Code exploration phase
139+
2. **analysis** - Deep analysis phase
140+
3. **implementation** - Implementation phase
141+
4. **documentation** - Documentation creation phase
142+
- `dynamic_prompts` - Context-aware follow-up prompts
143+
- `max_turns_in_phase` - Turn limits per phase
144+
145+
## Validation Test Suites (data/output/validation/test_suites/)
146+
147+
Each scenario has a corresponding test suite JSON with evaluation tests:
148+
149+
- **compilation** - Syntax validation, import resolution, type checking
150+
- **unit** - Function signatures, error handling, input validation, output correctness
151+
- **integration** - Module integration, database integration, API integration
152+
- **performance** - Execution time, memory usage, scalability
153+
- **security** - Injection prevention, input sanitization, access control
154+
155+
## Key Fields for Task Selection
156+
157+
For selecting high-complexity tasks that demonstrate MCP value:
158+
159+
1. **context_length** - Higher values indicate more complex projects requiring better context management
160+
2. **metadata.files_count** - More files suggest cross-file reasoning requirements
161+
3. **task_category** - Some categories inherently require more complex reasoning
162+
4. **difficulty** - Expert/hard tasks are more challenging
163+
164+
## Notes for Adapter Implementation
165+
166+
1. **File Path Format**: Context file paths use `//` as separator, needs normalization to `/`
167+
2. **Ground Truth Format**: Varies by task category (string for analysis tasks, object for bug investigation)
168+
3. **Language Parsing**: Extract from ID prefix (first `_`-separated token)
169+
4. **Project Location**: Match project from scenario ID prefix (e.g., `c_api_gateway_easy_009`) to find code in `generated/`

0 commit comments

Comments
 (0)