sourcegraph
diff --git a/‎benchmarks/ccb_document/docgen-changelog-001/environment/Dockerfile‎
Lines changed: 20 additions & 0 deletions b/‎benchmarks/ccb_document/docgen-changelog-001/environment/Dockerfile‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎benchmarks/ccb_document/docgen-changelog-001/instruction.md‎
Lines changed: 59 additions & 0 deletions b/‎benchmarks/ccb_document/docgen-changelog-001/instruction.md‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎benchmarks/ccb_document/docgen-changelog-001/task.toml‎
Lines changed: 55 additions & 0 deletions b/‎benchmarks/ccb_document/docgen-changelog-001/task.toml‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎benchmarks/ccb_document/docgen-changelog-001/tests/ground_truth.json‎
Lines changed: 79 additions & 0 deletions b/‎benchmarks/ccb_document/docgen-changelog-001/tests/ground_truth.json‎
Lines changed: 79 additions & 0 deletions
diff --git a/‎benchmarks/ccb_document/docgen-changelog-001/tests/test.sh‎
Lines changed: 98 additions & 0 deletions b/‎benchmarks/ccb_document/docgen-changelog-001/tests/test.sh‎
Lines changed: 98 additions & 0 deletions
diff --git a/‎benchmarks/ccb_document/docgen-changelog-002/environment/Dockerfile‎
Lines changed: 20 additions & 0 deletions b/‎benchmarks/ccb_document/docgen-changelog-002/environment/Dockerfile‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎benchmarks/ccb_document/docgen-changelog-002/instruction.md‎
Lines changed: 59 additions & 0 deletions b/‎benchmarks/ccb_document/docgen-changelog-002/instruction.md‎
Lines changed: 59 additions & 0 deletions
@@ -0,0 +1,20 @@
+FROM golang:1.22-bookworm
+
+WORKDIR /workspace
+
+RUN apt-get update && apt-get install -y \
+    git \
+    curl \
+    python3 \
+    npm \
+    ripgrep \
+    && rm -rf /var/lib/apt/lists/*
+
+RUN npm install -g @anthropic-ai/claude-code
+
+RUN git clone --filter=blob:none --no-checkout https://github.com/hashicorp/terraform.git . && \
+    git checkout a3dc571150a7651a1a4a8b302342d26089c97795 && \
+    git config user.email "agent@example.com" && \
+    git config user.name "Agent"
+
+RUN mkdir -p /app
@@ -0,0 +1,59 @@
+# Task: Generate Terraform Changelog
+
+**Repository:** hashicorp/terraform
+**Output:** Write your changelog to `/workspace/CHANGELOG.md`
+
+## Objective
+
+Generate a structured changelog entry for Terraform covering significant changes in the repository. Analyze the commit history, existing CHANGELOG.md, and source changes to produce a well-categorized changelog.
+
+## Scope
+
+Analyze the repository to identify:
+- **New features**: New commands, configuration options, provider capabilities
+- **Bug fixes**: Resolved issues with state management, plan/apply correctness, or CLI behavior
+- **Breaking changes**: Any changes that require user action or break backward compatibility
+- **Deprecations**: Features or behaviors being phased out
+- **Performance improvements**: Changes that improve plan/apply speed or memory usage
+
+Focus on changes visible in the `internal/`, `command/`, and `backend/` directories.
+
+## Output Format
+
+Write to `/workspace/CHANGELOG.md` using Terraform's established format:
+
+```markdown
+## [Unreleased] / next release
+
+### Breaking Changes
+
+- **[component]**: Description of breaking change and migration path
+
+### New Features
+
+- **[component]**: Description of new feature with usage example if applicable
+
+### Bug Fixes
+
+- **[component]**: Description of the bug and what was fixed
+
+### Performance Improvements
+
+- **[component]**: Description
+
+### Deprecations
+
+- **[component]**: What is deprecated and what to use instead
+```
+
+## Quality Bar
+
+- Every entry must reference a specific component (e.g., `backend/s3`, `command/apply`, `internal/states`)
+- Breaking changes must include a migration path or workaround
+- At least 3 entries per category that has changes
+- Do not fabricate entries — base everything on actual code/commit evidence
+
+## Anti-Requirements
+
+- Do not copy the existing CHANGELOG.md verbatim
+- Do not include changes that are not evidenced in the codebase
@@ -0,0 +1,55 @@
+version = "1.0"
+
+[metadata]
+name = "docgen-changelog-001"
+description = "Generate a structured changelog for Terraform v1.8.0 covering features, bug fixes, and breaking changes"
+license = "BUSL-1.1"
+
+[task]
+id = "docgen-changelog-001"
+repo = "hashicorp/terraform"
+category = "changelog_generation"
+language = "go"
+difficulty = "medium"
+time_limit_sec = 1200
+
+[task.metadata]
+complexity_dimension = "documentation_reasoning"
+doc_type = "changelog"
+target_module = "internal/,command/,backend/"
+output_path = "/workspace/CHANGELOG.md"
+
+[verification]
+type = "test"
+command = "bash /tests/test.sh"
+reward_type = "checklist"
+description = "Weighted checklist: change categorization accuracy (0.40) + completeness (0.30) + format compliance (0.30)"
+
+[environment]
+build_timeout_sec = 900.0
+
+[environment.setup_scripts]
+mcp_config = """#!/bin/bash
+if [ -n "$SOURCEGRAPH_ACCESS_TOKEN" ] && [ -n "$SOURCEGRAPH_URL" ]; then
+  echo "Setting up Sourcegraph MCP configuration..."
+  mkdir -p /root/.config/claude
+  cat > /root/.config/claude/mcp.json << 'MCPEOF'
+{
+  "mcpServers": {
+    "sourcegraph": {
+      "command": "npx",
+      "args": ["-y", "@sourcegraph/mcp-server"],
+      "env": {
+        "SRC_ACCESS_TOKEN": "$SOURCEGRAPH_ACCESS_TOKEN",
+        "SOURCEGRAPH_URL": "$SOURCEGRAPH_URL"
+      }
+    }
+  }
+}
+MCPEOF
+  echo "OK MCP configuration created"
+else
+  echo "No Sourcegraph credentials provided, MCP disabled"
+fi
+exit 0
+"""
@@ -0,0 +1,79 @@
+{
+  "task_id": "docgen-changelog-001",
+  "description": "Terraform Changelog Generation",
+  "weights": {
+    "categorization": 0.4,
+    "completeness": 0.3,
+    "format": 0.3
+  },
+  "categorization": [
+    {
+      "id": "c1",
+      "description": "Features section present",
+      "patterns": [
+        "## New Features|### New Features|## Features|### Features|## Enhancements"
+      ],
+      "weight": 0.25
+    },
+    {
+      "id": "c2",
+      "description": "Bug fixes section present",
+      "patterns": [
+        "## Bug Fixes|### Bug Fixes|## Bugs|### Bugs"
+      ],
+      "weight": 0.25
+    },
+    {
+      "id": "c3",
+      "description": "Breaking changes section present",
+      "patterns": [
+        "## Breaking|### Breaking|BREAKING"
+      ],
+      "weight": 0.25
+    },
+    {
+      "id": "c4",
+      "description": "Component references in entries",
+      "patterns": [
+        "backend/|command/|internal/|provider"
+      ],
+      "weight": 0.25
+    }
+  ],
+  "completeness": [
+    {
+      "id": "co1",
+      "description": "At least 3 entries total",
+      "patterns": [
+        "^-\\s+|^\\*\\s+"
+      ],
+      "weight": 0.5
+    },
+    {
+      "id": "co2",
+      "description": "References to actual Terraform subsystems",
+      "patterns": [
+        "state|plan|apply|provider|backend|workspace|module"
+      ],
+      "weight": 0.5
+    }
+  ],
+  "format": [
+    {
+      "id": "f1",
+      "description": "Markdown heading structure used",
+      "patterns": [
+        "^##|^###"
+      ],
+      "weight": 0.5
+    },
+    {
+      "id": "f2",
+      "description": "Consistent entry format (dash or bullet)",
+      "patterns": [
+        "^- |^\\* "
+      ],
+      "weight": 0.5
+    }
+  ]
+}
@@ -0,0 +1,98 @@
+#!/bin/bash
+# Reward: checklist (0.0-1.0) — documentation quality score
+set -uo pipefail
+
+[ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ] && source /tests/sgonly_verifier_wrapper.sh
+
+mkdir -p /logs/verifier
+
+OUTPUT_FILE="/workspace/CHANGELOG.md"
+EXTRA_FILES=""
+
+# Fallback: search for any .md file the agent wrote
+if [ ! -f "$OUTPUT_FILE" ]; then
+    ALT=$(find /workspace -maxdepth 2 -name "*.md" ! -name "README*" ! -name "CONTRIBUTING*" 2>/dev/null | head -1)
+    OUTPUT_FILE="${ALT:-$OUTPUT_FILE}"
+fi
+
+if [ ! -f "$OUTPUT_FILE" ]; then
+    echo "No output file found at /workspace/CHANGELOG.md"
+    echo "0.0" > /logs/verifier/reward.txt
+    exit 0
+fi
+
+# Write scorer script
+cat > /tmp/_scorer.py << 'SCOREEOF'
+import sys, re, json, os
+
+output_path = sys.argv[1]
+extra_paths = sys.argv[2:-1]
+gt_path = sys.argv[-1]
+
+try:
+    content = open(output_path).read()
+    gt = json.load(open(gt_path))
+except Exception as e:
+    print(f"Error reading files: {e}", file=sys.stderr)
+    print("0.0", end="")
+    sys.exit(0)
+
+# Also read any extra source files (for inline doc tasks)
+extra_content = ""
+for ep in extra_paths:
+    if ep != gt_path and os.path.isfile(ep):
+        try:
+            extra_content += "\n" + open(ep).read()
+        except Exception:
+            pass
+
+full_content = content + "\n" + extra_content
+
+weights = gt.get("weights", {})
+
+def check_section(section_items):
+    if not section_items:
+        return 0.0
+    total_weight = sum(item.get("weight", 1.0 / len(section_items)) for item in section_items)
+    section_score = 0.0
+    for item in section_items:
+        item_weight = item.get("weight", 1.0 / len(section_items))
+        patterns = item.get("patterns", [])
+        matched = False
+        for p in patterns:
+            try:
+                if re.search(p, full_content, re.MULTILINE | re.IGNORECASE):
+                    matched = True
+                    break
+            except re.error:
+                if p.lower() in full_content.lower():
+                    matched = True
+                    break
+        if matched:
+            section_score += item_weight / total_weight if total_weight > 0 else 0.0
+            print(f"  PASS: {item['description']}", file=sys.stderr)
+        else:
+            print(f"  FAIL: {item['description']}", file=sys.stderr)
+    return section_score
+
+total_score = 0.0
+for section_key, section_weight in weights.items():
+    section_items = gt.get(section_key, [])
+    section_score = check_section(section_items)
+    contribution = section_score * float(section_weight)
+    total_score += contribution
+    print(f"Section '{section_key}': {section_score:.2f} * {section_weight} = {contribution:.3f}", file=sys.stderr)
+
+total_score = min(1.0, max(0.0, total_score))
+print(f"Total score: {total_score:.2f}", file=sys.stderr)
+print(f"{total_score:.2f}", end="")
+
+SCOREEOF
+
+SCORE=$(python3 /tmp/_scorer.py "$OUTPUT_FILE" $EXTRA_FILES /tests/ground_truth.json 2>&1 | tail -1)
+# Re-run to get clean score (stderr goes to log, stdout is score)
+SCORE=$(python3 /tmp/_scorer.py "$OUTPUT_FILE" $EXTRA_FILES /tests/ground_truth.json 2>/dev/null)
+
+echo "$SCORE" > /logs/verifier/reward.txt
+echo "Score: $SCORE"
+exit 0
@@ -0,0 +1,20 @@
+FROM golang:1.22-bookworm
+
+WORKDIR /workspace
+
+RUN apt-get update && apt-get install -y \
+    git \
+    curl \
+    python3 \
+    npm \
+    ripgrep \
+    && rm -rf /var/lib/apt/lists/*
+
+RUN npm install -g @anthropic-ai/claude-code
+
+RUN git clone --filter=blob:none --no-checkout https://github.com/flipt-io/flipt.git . && \
+    git checkout 3d5a345f94c2adc8a0eaa102c189c08ad4c0f8e8 && \
+    git config user.email "agent@example.com" && \
+    git config user.name "Agent"
+
+RUN mkdir -p /app
@@ -0,0 +1,59 @@
+# Task: Generate Flipt Release Notes
+
+**Repository:** flipt-io/flipt
+**Output:** Write your release notes to `/workspace/RELEASE_NOTES.md`
+
+## Objective
+
+Generate comprehensive release notes for Flipt summarizing API changes. Flipt is an open-source feature flag management system with REST and gRPC APIs. Analyze the codebase to identify what changed in terms of the public API surface.
+
+## Scope
+
+Analyze the following areas for API changes:
+- `rpc/flipt/` — protobuf API definitions (gRPC surface)
+- `internal/server/` — REST API handlers
+- `internal/storage/` — storage backend changes
+- `ui/` — frontend changes if any significant ones exist
+- `CHANGELOG.md` in the repository (use as reference, not as copy)
+
+## Output Format
+
+Write to `/workspace/RELEASE_NOTES.md`:
+
+```markdown
+# Flipt Release Notes
+
+## Breaking Changes
+
+> These changes require action from users upgrading.
+
+- **[API/Storage/Config]**: Description + migration path
+
+## New Features
+
+- **[component]**: Description with API example if applicable
+
+## Deprecations
+
+- **[component]**: What is deprecated, what to use instead, when it will be removed
+
+## Bug Fixes
+
+- **[component]**: Description
+
+## Upgrade Guide
+
+Step-by-step instructions for users upgrading from the previous version.
+```
+
+## Quality Bar
+
+- Every breaking change must include a migration path
+- Every deprecation must specify the replacement
+- New features must reference the specific API endpoint or config option
+- The upgrade guide must be actionable (numbered steps)
+
+## Anti-Requirements
+
+- Do not copy the existing CHANGELOG.md
+- Do not fabricate API endpoints — verify in rpc/ or internal/server/