[Example] browse comp plus #377

garyzhang99 · 2025-11-12T03:15:29Z

Description

As the title says.

Checklist

Please check the following items before code is ready to be reviewed.

Code has passed all tests
Docstrings have been added/updated in Google Style
Documentation has been updated
Code is ready for review

gemini-code-assist · 2025-11-12T03:15:48Z

Summary of Changes

Hello @garyzhang99, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, self-contained training example for the BrowseComp-Plus dataset, focusing on developing ReAct-based search agents. It provides all necessary components, including a new workflow that integrates local search capabilities and an LLM-as-judge for evaluation, a configuration file, and a utility script for dataset preparation. The example aims to demonstrate how to train agents that can reason and act using search tools to answer questions, with clear setup and execution instructions.

Highlights

New BrowseComp-Plus Training Example: Introduces a complete example for training a ReAct search agent on the BrowseComp-Plus dataset.
ReAct Agent Workflow: Adds BCPSimpleToolReActWorkflow which leverages OpenAI function calling for local search and document retrieval.
LLM-as-Judge Evaluation: Implements an LLM-as-judge mechanism for evaluating agent responses and generating reward signals, using a structured JSON output format.
Data Preparation Script: Includes a Python script to convert the BrowseComp-Plus dataset into the Trinity-RFT format, facilitating train/test splits.
Comprehensive Configuration: Provides a detailed YAML configuration (bcp_config.yaml) for setting up the training environment, including model paths, searcher types, and evaluation parameters.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a new training example for the BrowseComp-Plus dataset, which is a great addition. The implementation includes a new workflow, data preparation script, configuration, and documentation. The code is generally well-structured and follows good practices. I've identified a few areas for improvement, mainly concerning robustness and consistency. My key feedback includes addressing a hardcoded tokenizer path, fixing unsafe string formatting that could break JSON parsing, and ensuring consistency in user-facing instructions. Please see the detailed comments for specific suggestions.

gemini-code-assist · 2025-11-12T03:17:34Z

trinity/common/workflows/envs/browse_comp_plus/bcp_simple_react_workflow.py

+    def __init__(
+        self,
+        searcher,
+        snippet_max_tokens: int | None = None,
+        k: int = 5,
+        include_get_document: bool = True,
+    ):
+        self.searcher = searcher
+        self.snippet_max_tokens = snippet_max_tokens
+        self.k = k
+        self.include_get_document = include_get_document
+
+        self.tokenizer = None
+        if snippet_max_tokens and snippet_max_tokens > 0:
+            try:
+                from transformers import AutoTokenizer
+
+                self.tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
+            except Exception:
+                self.tokenizer = None


The tokenizer path is hardcoded to Qwen/Qwen3-0.6B, which may not match the agent model specified in the configuration (e.g., Qwen/Qwen3-4B-Instruct-2507). Using a mismatched tokenizer for truncation can lead to incorrect behavior and is not robust.

To fix this, you should pass the model's path from the workflow to the SearchToolHandler and use it to load the correct tokenizer. You'll need to update the __init__ method here and then pass self.model_name when instantiating SearchToolHandler in _init_searcher.

def __init__( self, searcher, snippet_max_tokens: int | None = None, k: int = 5, include_get_document: bool = True, tokenizer_path: Optional[str] = None, ): self.searcher = searcher self.snippet_max_tokens = snippet_max_tokens self.k = k self.include_get_document = include_get_document self.tokenizer = None if snippet_max_tokens and snippet_max_tokens > 0: try: from transformers import AutoTokenizer path_to_load = tokenizer_path or "Qwen/Qwen3-0.6B" self.tokenizer = AutoTokenizer.from_pretrained(path_to_load) except Exception: self.tokenizer = None

gemini-code-assist · 2025-11-12T03:17:34Z

trinity/common/workflows/envs/browse_comp_plus/bcp_simple_react_workflow.py

+        judge_prompt = f"""You are an expert judge tasked with evaluating whether an agent's response to a question is correct.
+
+**Question**:
+<question>
+{question}
+</question>
+
+**Ground Truth Answer**:
+<correct_answer>
+{correct_answer}
+</correct_answer>
+
+**Agent's Response**:
+<response>
+{final_answer}
+</response>
+
+Your task is to determine if the agent's response is correct based on the ground truth answer. Be strict and precise in your judgment.
+
+**Evaluation Criteria**:
+1. Extract the final answer from the agent's response
+2. Compare it with the ground truth answer
+3. The agent's answer is correct ONLY if it is semantically equivalent to the ground truth
+4. Allow for minor variations in phrasing, but the core information must match exactly
+5. For numerical answers, allow small rounding differences (within 1% or 0.1 units)
+6. If the agent's response contains additional correct information beyond the ground truth, it can still be marked as correct
+7. If the agent's response is ambiguous, contradictory, or contains incorrect information, mark it as incorrect
+8. If the agent did not provide a clear final answer, mark it as incorrect
+
+**Output Format**: You MUST respond with a valid JSON object in the following format (no additional text):
+
+{{
+  "extracted_answer": "The exact answer extracted from the agent's response, or null if no answer was provided",
+  "ground_truth": "{correct_answer}",
+  "reasoning": "Brief explanation of why the extracted answer is correct or incorrect compared to the ground truth. If no answer was provided in the agent's response, mark it as incorrect.",
+  "is_correct": true or false
+}}
+
+Respond ONLY with the JSON object, no additional text before or after."""


The correct_answer is directly embedded into the JSON structure of the judge_prompt using an f-string. If correct_answer contains special characters like double quotes, it will break the JSON format in the prompt, likely causing the judge model to fail or produce incorrect output.

To prevent this, you should use json.dumps() to safely serialize the correct_answer into a valid JSON string within the prompt.

judge_prompt = f"""You are an expert judge tasked with evaluating whether an agent's response to a question is correct. **Question**: <question> {question} </question> **Ground Truth Answer**: <correct_answer> {correct_answer} </correct_answer> **Agent's Response**: <response> {final_answer} </response> Your task is to determine if the agent's response is correct based on the ground truth answer. Be strict and precise in your judgment. **Evaluation Criteria**: 1. Extract the final answer from the agent's response 2. Compare it with the ground truth answer 3. The agent's answer is correct ONLY if it is semantically equivalent to the ground truth 4. Allow for minor variations in phrasing, but the core information must match exactly 5. For numerical answers, allow small rounding differences (within 1% or 0.1 units) 6. If the agent's response contains additional correct information beyond the ground truth, it can still be marked as correct 7. If the agent's response is ambiguous, contradictory, or contains incorrect information, mark it as incorrect 8. If the agent did not provide a clear final answer, mark it as incorrect **Output Format**: You MUST respond with a valid JSON object in the following format (no additional text): {{ "extracted_answer": "The exact answer extracted from the agent's response, or null if no answer was provided", "ground_truth": {json.dumps(correct_answer)}, "reasoning": "Brief explanation of why the extracted answer is correct or incorrect compared to the ground truth. If no answer was provided in the agent's response, mark it as incorrect.", "is_correct": true or false }} Respond ONLY with the JSON object, no additional text before or after."""

examples/browse_comp_plus/README.md

examples/browse_comp_plus/get_browse_comp_data_for_trinity.py

trinity/common/workflows/envs/browse_comp_plus/bcp_simple_react_workflow.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

add browse comp plus as an training example

5e854a2

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

garyzhang99 and others added 3 commits November 13, 2025 17:26

Update examples/browse_comp_plus/README.md

634014e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update examples/browse_comp_plus/get_browse_comp_data_for_trinity.py

92f5f1d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update examples/browse_comp_plus/get_browse_comp_data_for_trinity.py

f7fe5eb

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

hiyuchang changed the title ~~Add training on browse comp plus as an training example~~ [Example] browse comp plus Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Example] browse comp plus #377

[Example] browse comp plus #377

Uh oh!

garyzhang99 commented Nov 12, 2025

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Example] browse comp plus #377

Are you sure you want to change the base?

[Example] browse comp plus #377

Uh oh!

Conversation

garyzhang99 commented Nov 12, 2025

Description

Checklist

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant