Skip to content

Commit 479cd19

Browse files
committed
Update difficulty analysis explanation and v2 formula details
1 parent dcb1bac commit 479cd19

File tree

2 files changed

+38
-4
lines changed

2 files changed

+38
-4
lines changed

docs/WHITE_PAPER_REPORT_V2.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ Tasks are drawn from established benchmarks and custom-authored challenges, then
166166
| Suite | SDLC Phase | Tasks | Difficulty Range | Languages |
167167
| ---------------- | ------------------------- | ----: | ---------------- | ------------------------------------ |
168168
| `ccb_understand` | Requirements & Discovery | 20 | hard | C++, Go, Java, Python, TS |
169-
| `ccb_design` | Architecture & Design | 20 | hard--very_hard | C, C++, Go, Java, Python |
169+
| `ccb_design` | Architecture & Design | 20 | hard--expert | C, C++, Go, Java, Python |
170170
| `ccb_fix` | Bug Repair | 25 | medium--hard | C++, Go, Java, JS, Python, TS |
171171
| `ccb_build` | Feature & Refactoring | 25 | medium--hard | C#, C++, Go, Java, JS, Rust, TS |
172172
| `ccb_test` | Testing & QA | 20 | medium--hard | C, C#, C++, Go, Java, JS, Python, TS |
@@ -923,7 +923,24 @@ C tasks have the highest mean reward (0.801), driven by the Linux kernel fault l
923923
| Hard | 136 | 0.638 | 0.611 | 89.0% |
924924
| Expert | 13 | 0.738 | 0.728 | 100.0% |
925925
926-
The counterintuitive result that "hard" tasks outperform "medium" tasks reflects that difficulty ratings were assigned based on expected human effort, not agent capability. Difficulty is a task-authoring metadata field (`task.toml` / selection registry `difficulty`) set from the anticipated human effort and coordination complexity of the scenario, rather than calibrated to current model behavior. Expert tasks (all Linux kernel fault localization) score highest because they are well-structured pattern-matching problems that agents handle effectively despite the large codebase scale.
926+
The `hard > medium` result here is primarily a composition effect in the paired direct-run subset, not evidence that the difficulty metadata is inverted. Medium has only 21 paired tasks and is concentrated in lower-performing slices (`ccb_fix`, `ccb_build`, and `ccb_test`), while hard has 136 paired tasks spread across several higher-performing suites (`ccb_document`, `ccb_understand`, `ccb_secure`, `ccb_design`, etc.). Expert remains highest because this bucket is small (13 tasks) and dominated by Linux fault-localization tasks that currently score well under the rubric verifier.
927+
928+
Difficulty metadata is assigned by the deterministic v2 formula used in `scripts/rescore_difficulty.py`:
929+
930+
```text
931+
difficulty_score = 0.40*size_score + 0.35*complexity_score + 0.25*ground_truth_depth_score
932+
label = expert if score >= 0.86
933+
label = hard if score >= 0.62 and < 0.86
934+
label = medium otherwise
935+
override: ccb_debug linux-* -> expert
936+
```
937+
938+
Priors and task-type values used by the formula:
939+
940+
- `size_score`: `0.70*context_norm + 0.30*files_norm`; priors if missing are `context=0.60`, `files=0.55`.
941+
- `complexity_score`: from `mcp_breakdown` when present (`0.50*cross_file_deps + 0.30*semantic_search_potential + 0.20*task_category_weight`), with priors `cross_file_deps=0.70`, `semantic_search_potential=0.70`.
942+
- `task_category_weight` defaults come from category lookup (examples: `documentation_generation=0.58`, `feature_implementation=0.76`, `cross_file_reasoning=0.84`, `fault_localization=0.92`, `dependency_chain_analysis=0.95`).
943+
- `ground_truth_depth_score`: base from `reward_type` lookup (e.g., `binary=0.35`, `test_ratio=0.45`, `checklist=0.78`, `ir_checklist=0.86`, `find_and_prove=0.90`) plus bonuses for GT richness (`ground_truth.json` component count, `criteria.json`, `oracle_answer.json`).
927944
928945
### 11.5 Reward by Codebase Size
929946

docs/technical_reports/TECHNICAL_REPORT_V1.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ Tasks are drawn from established benchmarks and custom-authored challenges, then
166166
| Suite | SDLC Phase | Tasks | Difficulty Range | Languages |
167167
| ---------------- | ------------------------- | ----: | ---------------- | ------------------------------------ |
168168
| `ccb_understand` | Requirements & Discovery | 20 | hard | C++, Go, Java, Python, TS |
169-
| `ccb_design` | Architecture & Design | 20 | hard--very_hard | C, C++, Go, Java, Python |
169+
| `ccb_design` | Architecture & Design | 20 | hard--expert | C, C++, Go, Java, Python |
170170
| `ccb_fix` | Bug Repair | 25 | medium--hard | C++, Go, Java, JS, Python, TS |
171171
| `ccb_build` | Feature & Refactoring | 25 | medium--hard | C#, C++, Go, Java, JS, Rust, TS |
172172
| `ccb_test` | Testing & QA | 20 | medium--hard | C, C#, C++, Go, Java, JS, Python, TS |
@@ -923,7 +923,24 @@ C tasks have the highest mean reward (0.801), driven by the Linux kernel fault l
923923
| Hard | 136 | 0.638 | 0.611 | 89.0% |
924924
| Expert | 13 | 0.738 | 0.728 | 100.0% |
925925
926-
The counterintuitive result that "hard" tasks outperform "medium" tasks reflects that difficulty ratings were assigned based on expected human effort, not agent capability. Difficulty is a task-authoring metadata field (`task.toml` / selection registry `difficulty`) set from the anticipated human effort and coordination complexity of the scenario, rather than calibrated to current model behavior. Expert tasks (all Linux kernel fault localization) score highest because they are well-structured pattern-matching problems that agents handle effectively despite the large codebase scale.
926+
The `hard > medium` result here is primarily a composition effect in the paired direct-run subset, not evidence that the difficulty metadata is inverted. Medium has only 21 paired tasks and is concentrated in lower-performing slices (`ccb_fix`, `ccb_build`, and `ccb_test`), while hard has 136 paired tasks spread across several higher-performing suites (`ccb_document`, `ccb_understand`, `ccb_secure`, `ccb_design`, etc.). Expert remains highest because this bucket is small (13 tasks) and dominated by Linux fault-localization tasks that currently score well under the rubric verifier.
927+
928+
Difficulty metadata is assigned by the deterministic v2 formula used in `scripts/rescore_difficulty.py`:
929+
930+
```text
931+
difficulty_score = 0.40*size_score + 0.35*complexity_score + 0.25*ground_truth_depth_score
932+
label = expert if score >= 0.86
933+
label = hard if score >= 0.62 and < 0.86
934+
label = medium otherwise
935+
override: ccb_debug linux-* -> expert
936+
```
937+
938+
Priors and task-type values used by the formula:
939+
940+
- `size_score`: `0.70*context_norm + 0.30*files_norm`; priors if missing are `context=0.60`, `files=0.55`.
941+
- `complexity_score`: from `mcp_breakdown` when present (`0.50*cross_file_deps + 0.30*semantic_search_potential + 0.20*task_category_weight`), with priors `cross_file_deps=0.70`, `semantic_search_potential=0.70`.
942+
- `task_category_weight` defaults come from category lookup (examples: `documentation_generation=0.58`, `feature_implementation=0.76`, `cross_file_reasoning=0.84`, `fault_localization=0.92`, `dependency_chain_analysis=0.95`).
943+
- `ground_truth_depth_score`: base from `reward_type` lookup (e.g., `binary=0.35`, `test_ratio=0.45`, `checklist=0.78`, `ir_checklist=0.86`, `find_and_prove=0.90`) plus bonuses for GT richness (`ground_truth.json` component count, `criteria.json`, `oracle_answer.json`).
927944
928945
### 11.5 Reward by Codebase Size
929946

0 commit comments

Comments
 (0)