sourcegraph
diff --git a/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/environment/Dockerfile‎
Lines changed: 5 additions & 0 deletions b/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/environment/Dockerfile‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/instruction.md‎
Lines changed: 73 additions & 0 deletions b/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/instruction.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/task.toml‎
Lines changed: 54 additions & 0 deletions b/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/task.toml‎
Lines changed: 54 additions & 0 deletions
diff --git a/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/tests/ground_truth.json‎
Lines changed: 8 additions & 0 deletions b/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/tests/ground_truth.json‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/tests/test.sh‎
Lines changed: 61 additions & 0 deletions b/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/tests/test.sh‎
Lines changed: 61 additions & 0 deletions
diff --git a/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/tests/verifiers.py‎
Lines changed: 75 additions & 0 deletions b/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-201/tests/verifiers.py‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-202/environment/Dockerfile‎
Lines changed: 5 additions & 0 deletions b/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-202/environment/Dockerfile‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-202/instruction.md‎
Lines changed: 74 additions & 0 deletions b/‎benchmarks/ccb_mcp_onboarding/ccx-onboard-search-202/instruction.md‎
Lines changed: 74 additions & 0 deletions
@@ -0,0 +1,5 @@
+FROM python:3.11-slim
+RUN apt-get update && apt-get install -y git curl && rm -rf /var/lib/apt/lists/*
+RUN pip install --no-cache-dir numpy
+WORKDIR /app
+RUN mkdir -p /logs/agent /logs/verifier
@@ -0,0 +1,73 @@
+        # RepoQA: Semantic Retrieval (SR-QA)
+
+        ## Task: Find the Function
+
+        You are searching a large codebase for a specific function based on its behavior.
+
+        **Repository**: kubernetes/kubernetes
+        **Language**: go
+
+        ## Function Description
+
+        ```
+        1. **Purpose**: Identifies which cluster nodes satisfy all scheduling filter plugins for a given workload, enabling the scheduler to narrow down placement candidates.
+2. **Input**: Takes a context, a framework handle providing filter plugins and parallelism settings, cycle state, a pod specification, a diagnosis collector for recording filter failures, and a pre-fetched list of all node information objects.
+3. **Output**: Returns a slice of node-info objects representing feasible placement targets, plus an error if any filter plugin returned a fatal error. Also populates the diagnosis object with per-node failure reasons as a side effect.
+4. **Procedure**:
+   - Computes the target number of feasible nodes to find, reducing to 1 if there are no extender filters and no scoring plugins.
+   - If no filter plugins are registered, returns the first N nodes starting from a round-robin offset.
+   - Otherwise, defines an inner closure that runs all filter plugins against each node in parallel, starting from the last scheduling cycle's offset to ensure fairness.
+   - Uses atomic counters to track how many feasible nodes have been found; cancels the parallel search early once the target count is reached.
+   - Records non-feasible node statuses into a result array under the parallel check, then copies them into the diagnosis object after all parallel work completes.
+   - Measures and reports the total Filter extension point latency via deferred metrics emission.
+        ```
+
+        ## Search Strategy
+
+        This function **cannot be found by searching for its name** because the name is not provided. You must:
+
+        1. **Understand the behavior** described above
+        2. **Search the codebase** to find functions matching this behavior
+        3. **Explore the code** using call graphs and references
+        4. **Narrow down** candidates until you find the exact function
+
+
+        ## Output Format
+
+        You MUST provide your answer as valid JSON and **SAVE IT TO A FILE**:
+
+        ```json
+        {
+          "function_path": "path/to/file.ext",
+          "function_name": "the_function_name",
+          "justification": "Why this function matches: describe the behavior you found"
+        }
+        ```
+
+        **CRITICAL**: You MUST save the JSON to `/app/solution.json`. This location is required for verification.
+
+        **Your final step MUST be to run this exact bash command:**
+
+        ```bash
+        cat > /app/solution.json << 'JSONEOF'
+        {
+          "function_path": "ACTUAL_PATH",
+          "function_name": "ACTUAL_NAME",
+          "justification": "ACTUAL_JUSTIFICATION_TEXT"
+        }
+        JSONEOF
+        ```
+
+        ## Notes
+
+        - The file path should be relative to repository root
+        - Function names are case-sensitive
+        - Provide your best match even if uncertain; explain your reasoning
+        - The justification is scored on how well it explains the function's behavior
+
+        ## Scoring
+
+        - **Perfect** (1.0): Correct path AND name
+        - **Good** (0.7-0.9): Correct path, similar name OR vice versa
+        - **Partial** (0.3-0.6): Close approximation
+        - **Incorrect** (0.0): Wrong function entirely
@@ -0,0 +1,54 @@
+version = "1.0"
+
+[metadata]
+name = "ccx-onboard-search-201"
+description = "Find a function in kubernetes/kubernetes (4M+ LOC) from a behavioral description"
+difficulty = "hard"
+category = "semantic-code-navigation"
+tags = ["ccb_mcp_onboarding", "go", "sr-qa", "large-repo", "repoqa"]
+language = "go"
+
+[task]
+id = "ccx-onboard-search-201"
+repo = "kubernetes/kubernetes"
+category = "ccb_mcp_onboarding"
+language = "go"
+difficulty = "hard"
+time_limit_sec = 1200
+
+[verification]
+type = "test"
+command = "bash /tests/test.sh"
+reward_type = "semantic_similarity"
+description = "Correct function retrieval similarity score"
+
+[environment]
+build_timeout_sec = 1800.0
+cpus = 2
+memory = "4G"
+storage = "10G"
+
+[environment.setup_scripts]
+mcp_config = """#!/bin/bash
+if [ -n "$SOURCEGRAPH_ACCESS_TOKEN" ] && [ -n "$SOURCEGRAPH_URL" ]; then
+  mkdir -p /root/.config/claude
+  cat > /root/.config/claude/mcp.json << 'MCPEOF'
+{
+  "mcpServers": {
+    "sourcegraph": {
+      "command": "npx",
+      "args": ["-y", "@sourcegraph/mcp-server"],
+      "env": {
+        "SRC_ACCESS_TOKEN": "$SOURCEGRAPH_ACCESS_TOKEN",
+        "SOURCEGRAPH_URL": "$SOURCEGRAPH_URL"
+      }
+    }
+  }
+}
+MCPEOF
+  echo "MCP configuration created"
+else
+  echo "No Sourcegraph credentials provided, MCP disabled"
+fi
+exit 0
+"""
@@ -0,0 +1,8 @@
+{
+  "function_id": "pkg/scheduler/schedule_one.go::findNodesThatPassFilters",
+  "canonical_path": "pkg/scheduler/schedule_one.go",
+  "canonical_name": "findNodesThatPassFilters",
+  "language": "go",
+  "nl_description": "1. **Purpose**: Identifies which cluster nodes satisfy all scheduling filter plugins for a given workload, enabling the scheduler to narrow down placement candidates.\n2. **Input**: Takes a context, a framework handle providing filter plugins and parallelism settings, cycle state, a pod specification, a diagnosis collector for recording filter failures, and a pre-fetched list of all node information objects.\n3. **Output**: Returns a slice of node-info objects representing feasible placement targets, plus an error if any filter plugin returned a fatal error. Also populates the diagnosis object with per-node failure reasons as a side effect.\n4. **Procedure**:\n   - Computes the target number of feasible nodes to find, reducing to 1 if there are no extender filters and no scoring plugins.\n   - If no filter plugins are registered, returns the first N nodes starting from a round-robin offset.\n   - Otherwise, defines an inner closure that runs all filter plugins against each node in parallel, starting from the last scheduling cycle's offset to ensure fairness.\n   - Uses atomic counters to track how many feasible nodes have been found; cancels the parallel search early once the target count is reached.\n   - Records non-feasible node statuses into a result array under the parallel check, then copies them into the diagnosis object after all parallel work completes.\n   - Measures and reports the total Filter extension point latency via deferred metrics emission.",
+  "task_variant": "sr-qa"
+}
@@ -0,0 +1,61 @@
+#!/bin/bash
+# RepoQA SR-QA Verification Script
+echo "Starting RepoQA verifier..." 1>&2
+cd /app || { echo "ERROR: Cannot cd to /app"; exit 1; }
+mkdir -p /logs/verifier
+
+if [ ! -f /tests/ground_truth.json ]; then
+    echo "ERROR: No ground_truth.json found at /tests/ground_truth.json"
+    echo '{"score": 0.0}' > /logs/verifier/reward.json
+    echo "0.0" > /logs/verifier/reward.txt
+    exit 0
+fi
+
+SOLUTION_FILE="/app/solution.json"
+if [ ! -f "$SOLUTION_FILE" ]; then
+    echo "ERROR: Agent did not create solution.json in /app/"
+    echo '{"score": 0.0}' > /logs/verifier/reward.json
+    echo "0.0" > /logs/verifier/reward.txt
+    exit 0
+fi
+
+cat > /tmp/verify.py << 'PYEOF'
+import json, sys, re
+sys.path.insert(0, "/tests")
+from verifiers import SemanticRetrievalQAVerifier
+
+try:
+    with open("/tests/ground_truth.json") as f:
+        ground_truth = json.load(f)
+    with open("/app/solution.json") as f:
+        raw = f.read()
+    matches = re.findall(r"```(?:json)?\s*\n(.*?)```", raw, re.DOTALL)
+    if matches:
+        raw = matches[-1].strip()
+    agent_output = json.loads(raw)
+
+    verifier = SemanticRetrievalQAVerifier(ground_truth)
+    result = verifier.verify(agent_output)
+    reward = {"score": float(result.correct_function)}
+
+    print(f"Correct Function: {result.correct_function:.2f}")
+    print(f"Correct Path: {result.correct_path:.2f}")
+    print(f"Justification: {result.justification_score:.2f}")
+    print(f"Details: {result.reasoning}")
+
+    with open("/logs/verifier/reward.json", "w") as f:
+        json.dump(reward, f, indent=2)
+    with open("/logs/verifier/reward.txt", "w") as f:
+        f.write(str(reward["score"]))
+except Exception as e:
+    import traceback
+    print(f"ERROR: {e}")
+    traceback.print_exc()
+    with open("/logs/verifier/reward.json", "w") as f:
+        json.dump({"score": 0.0}, f)
+    with open("/logs/verifier/reward.txt", "w") as f:
+        f.write("0.0")
+PYEOF
+
+python3 /tmp/verify.py 2>&1 | tee /logs/verifier/verify-debug.log
+exit 0
@@ -0,0 +1,75 @@
+"""Verifiers for RepoQA SR-QA tasks. Scores agent function retrieval."""
+
+import json
+import re
+from dataclasses import dataclass
+from difflib import SequenceMatcher
+from pathlib import Path
+from typing import Any, Dict
+
+
+@dataclass
+class VerificationResult:
+    correct_function: float
+    correct_path: float
+    justification_score: float
+    reasoning: str = ""
+
+
+class SemanticRetrievalQAVerifier:
+    def __init__(self, ground_truth: Dict[str, Any]):
+        self.ground_truth = ground_truth
+
+    def verify(self, agent_output: Dict[str, Any]) -> VerificationResult:
+        try:
+            path = agent_output.get("function_path", "")
+            name = agent_output.get("function_name", "")
+            justification = agent_output.get("justification", "")
+        except (KeyError, TypeError) as e:
+            return VerificationResult(0.0, 0.0, 0.0, f"Invalid output: {e}")
+
+        canonical_path = self.ground_truth.get("canonical_path", "")
+        canonical_name = self.ground_truth.get("canonical_name", "")
+        nl_description = self.ground_truth.get("nl_description", "")
+
+        path_score = self._path_similarity(path, canonical_path)
+        name_score = self._name_similarity(name, canonical_name)
+
+        if path_score == 1.0 and name_score == 1.0:
+            function_score = 1.0
+        elif path_score == 1.0 and name_score > 0.7:
+            function_score = 0.8
+        elif path_score > 0.8 and name_score == 1.0:
+            function_score = 0.8
+        elif path_score > 0.5 and name_score > 0.5:
+            function_score = 0.3
+        else:
+            function_score = 0.0
+
+        justification_score = self._keyword_overlap(justification, nl_description)
+
+        reasoning = (
+            f"Path match: {path_score:.2f} (expected {canonical_path})\n"
+            f"Name match: {name_score:.2f} (expected {canonical_name})\n"
+            f"Justification keywords: {justification_score:.2f}"
+        )
+        return VerificationResult(function_score, path_score, justification_score, reasoning)
+
+    @staticmethod
+    def _path_similarity(p1: str, p2: str) -> float:
+        p1, p2 = Path(p1).as_posix(), Path(p2).as_posix()
+        return 1.0 if p1 == p2 else SequenceMatcher(None, p1, p2).ratio()
+
+    @staticmethod
+    def _name_similarity(n1: str, n2: str) -> float:
+        return 1.0 if n1 == n2 else SequenceMatcher(None, n1.lower(), n2.lower()).ratio()
+
+    @staticmethod
+    def _keyword_overlap(text1: str, text2: str) -> float:
+        if not text1 or not text2:
+            return 0.0
+        w1 = set(re.findall(r"\w+", text1.lower()))
+        w2 = set(re.findall(r"\w+", text2.lower()))
+        if not w1 or not w2:
+            return 0.0
+        return len(w1 & w2) / len(w1 | w2)
@@ -0,0 +1,5 @@
+FROM python:3.11-slim
+RUN apt-get update && apt-get install -y git curl && rm -rf /var/lib/apt/lists/*
+RUN pip install --no-cache-dir numpy
+WORKDIR /app
+RUN mkdir -p /logs/agent /logs/verifier
@@ -0,0 +1,74 @@
+        # RepoQA: Semantic Retrieval (SR-QA)
+
+        ## Task: Find the Function
+
+        You are searching a large codebase for a specific function based on its behavior.
+
+        **Repository**: kubernetes/kubernetes
+        **Language**: go
+
+        ## Function Description
+
+        ```
+        1. **Purpose**: Performs a single synchronization cycle of the node eviction manager, evaluating current resource usage against configured thresholds and, if necessary, selecting and terminating one workload to relieve resource pressure.
+2. **Input**: Operates as a method on the eviction manager, receiving a context, a list of active workloads (pods), a function to retrieve resource usage statistics, and a function to check if a pod has been cleaned up. It implicitly reads node summary statistics from the summary provider.
+3. **Output**: Returns a slice of pods that were evicted during this cycle (at most one) and an error. As side effects, it updates internal state: the set of met thresholds, node condition timestamps, and observation history.
+4. **Procedure**:
+   - Refreshes memory threshold notifiers from the latest statistics summary.
+   - Computes signal observations (e.g., memory available, disk available) and determines which thresholds are currently met, both ignoring and respecting grace periods.
+   - Tracks when each threshold was first observed and when each node condition was last observed, applying a transition period before declaring conditions active.
+   - Filters thresholds to only those whose grace periods are fully met and whose stats have been updated since the last sync.
+   - Checks for local storage eviction violations first (pod-level disk usage); if any pods are evicted there, returns early.
+   - Sorts remaining thresholds by eviction priority, identifies the highest-priority reclaimable resource, and first attempts node-level reclamation (e.g., garbage-collecting images or containers).
+   - If node-level reclamation is insufficient, ranks all active pods using a signal-specific ranking function, then iterates through ranked pods and evicts the first one that can be killed.
+        ```
+
+        ## Search Strategy
+
+        This function **cannot be found by searching for its name** because the name is not provided. You must:
+
+        1. **Understand the behavior** described above
+        2. **Search the codebase** to find functions matching this behavior
+        3. **Explore the code** using call graphs and references
+        4. **Narrow down** candidates until you find the exact function
+
+
+        ## Output Format
+
+        You MUST provide your answer as valid JSON and **SAVE IT TO A FILE**:
+
+        ```json
+        {
+          "function_path": "path/to/file.ext",
+          "function_name": "the_function_name",
+          "justification": "Why this function matches: describe the behavior you found"
+        }
+        ```
+
+        **CRITICAL**: You MUST save the JSON to `/app/solution.json`. This location is required for verification.
+
+        **Your final step MUST be to run this exact bash command:**
+
+        ```bash
+        cat > /app/solution.json << 'JSONEOF'
+        {
+          "function_path": "ACTUAL_PATH",
+          "function_name": "ACTUAL_NAME",
+          "justification": "ACTUAL_JUSTIFICATION_TEXT"
+        }
+        JSONEOF
+        ```
+
+        ## Notes
+
+        - The file path should be relative to repository root
+        - Function names are case-sensitive
+        - Provide your best match even if uncertain; explain your reasoning
+        - The justification is scored on how well it explains the function's behavior
+
+        ## Scoring
+
+        - **Perfect** (1.0): Correct path AND name
+        - **Good** (0.7-0.9): Correct path, similar name OR vice versa
+        - **Partial** (0.3-0.6): Close approximation
+        - **Incorrect** (0.0): Wrong function entirely