Skip to content

Conversation

@garyzhang99
Copy link
Collaborator

Description

As the title says.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @garyzhang99, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, self-contained training example for the BrowseComp-Plus dataset, focusing on developing ReAct-based search agents. It provides all necessary components, including a new workflow that integrates local search capabilities and an LLM-as-judge for evaluation, a configuration file, and a utility script for dataset preparation. The example aims to demonstrate how to train agents that can reason and act using search tools to answer questions, with clear setup and execution instructions.

Highlights

  • New BrowseComp-Plus Training Example: Introduces a complete example for training a ReAct search agent on the BrowseComp-Plus dataset.
  • ReAct Agent Workflow: Adds BCPSimpleToolReActWorkflow which leverages OpenAI function calling for local search and document retrieval.
  • LLM-as-Judge Evaluation: Implements an LLM-as-judge mechanism for evaluating agent responses and generating reward signals, using a structured JSON output format.
  • Data Preparation Script: Includes a Python script to convert the BrowseComp-Plus dataset into the Trinity-RFT format, facilitating train/test splits.
  • Comprehensive Configuration: Provides a detailed YAML configuration (bcp_config.yaml) for setting up the training environment, including model paths, searcher types, and evaluation parameters.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new training example for the BrowseComp-Plus dataset, which is a great addition. The implementation includes a new workflow, data preparation script, configuration, and documentation. The code is generally well-structured and follows good practices. I've identified a few areas for improvement, mainly concerning robustness and consistency. My key feedback includes addressing a hardcoded tokenizer path, fixing unsafe string formatting that could break JSON parsing, and ensuring consistency in user-facing instructions. Please see the detailed comments for specific suggestions.

Comment on lines +31 to +50
def __init__(
self,
searcher,
snippet_max_tokens: int | None = None,
k: int = 5,
include_get_document: bool = True,
):
self.searcher = searcher
self.snippet_max_tokens = snippet_max_tokens
self.k = k
self.include_get_document = include_get_document

self.tokenizer = None
if snippet_max_tokens and snippet_max_tokens > 0:
try:
from transformers import AutoTokenizer

self.tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
except Exception:
self.tokenizer = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tokenizer path is hardcoded to Qwen/Qwen3-0.6B, which may not match the agent model specified in the configuration (e.g., Qwen/Qwen3-4B-Instruct-2507). Using a mismatched tokenizer for truncation can lead to incorrect behavior and is not robust.

To fix this, you should pass the model's path from the workflow to the SearchToolHandler and use it to load the correct tokenizer. You'll need to update the __init__ method here and then pass self.model_name when instantiating SearchToolHandler in _init_searcher.

    def __init__(
        self,
        searcher,
        snippet_max_tokens: int | None = None,
        k: int = 5,
        include_get_document: bool = True,
        tokenizer_path: Optional[str] = None,
    ):
        self.searcher = searcher
        self.snippet_max_tokens = snippet_max_tokens
        self.k = k
        self.include_get_document = include_get_document

        self.tokenizer = None
        if snippet_max_tokens and snippet_max_tokens > 0:
            try:
                from transformers import AutoTokenizer

                path_to_load = tokenizer_path or "Qwen/Qwen3-0.6B"
                self.tokenizer = AutoTokenizer.from_pretrained(path_to_load)
            except Exception:
                self.tokenizer = None

Comment on lines +523 to +561
judge_prompt = f"""You are an expert judge tasked with evaluating whether an agent's response to a question is correct.

**Question**:
<question>
{question}
</question>

**Ground Truth Answer**:
<correct_answer>
{correct_answer}
</correct_answer>

**Agent's Response**:
<response>
{final_answer}
</response>

Your task is to determine if the agent's response is correct based on the ground truth answer. Be strict and precise in your judgment.

**Evaluation Criteria**:
1. Extract the final answer from the agent's response
2. Compare it with the ground truth answer
3. The agent's answer is correct ONLY if it is semantically equivalent to the ground truth
4. Allow for minor variations in phrasing, but the core information must match exactly
5. For numerical answers, allow small rounding differences (within 1% or 0.1 units)
6. If the agent's response contains additional correct information beyond the ground truth, it can still be marked as correct
7. If the agent's response is ambiguous, contradictory, or contains incorrect information, mark it as incorrect
8. If the agent did not provide a clear final answer, mark it as incorrect

**Output Format**: You MUST respond with a valid JSON object in the following format (no additional text):

{{
"extracted_answer": "The exact answer extracted from the agent's response, or null if no answer was provided",
"ground_truth": "{correct_answer}",
"reasoning": "Brief explanation of why the extracted answer is correct or incorrect compared to the ground truth. If no answer was provided in the agent's response, mark it as incorrect.",
"is_correct": true or false
}}

Respond ONLY with the JSON object, no additional text before or after."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The correct_answer is directly embedded into the JSON structure of the judge_prompt using an f-string. If correct_answer contains special characters like double quotes, it will break the JSON format in the prompt, likely causing the judge model to fail or produce incorrect output.

To prevent this, you should use json.dumps() to safely serialize the correct_answer into a valid JSON string within the prompt.

        judge_prompt = f"""You are an expert judge tasked with evaluating whether an agent's response to a question is correct.

**Question**:
<question>
{question}
</question>

**Ground Truth Answer**:
<correct_answer>
{correct_answer}
</correct_answer>

**Agent's Response**:
<response>
{final_answer}
</response>

Your task is to determine if the agent's response is correct based on the ground truth answer. Be strict and precise in your judgment.

**Evaluation Criteria**:
1. Extract the final answer from the agent's response
2. Compare it with the ground truth answer
3. The agent's answer is correct ONLY if it is semantically equivalent to the ground truth
4. Allow for minor variations in phrasing, but the core information must match exactly
5. For numerical answers, allow small rounding differences (within 1% or 0.1 units)
6. If the agent's response contains additional correct information beyond the ground truth, it can still be marked as correct
7. If the agent's response is ambiguous, contradictory, or contains incorrect information, mark it as incorrect
8. If the agent did not provide a clear final answer, mark it as incorrect

**Output Format**: You MUST respond with a valid JSON object in the following format (no additional text):

{{
  "extracted_answer": "The exact answer extracted from the agent's response, or null if no answer was provided",
  "ground_truth": {json.dumps(correct_answer)},
  "reasoning": "Brief explanation of why the extracted answer is correct or incorrect compared to the ground truth. If no answer was provided in the agent's response, mark it as incorrect.",
  "is_correct": true or false
}}

Respond ONLY with the JSON object, no additional text before or after."""

garyzhang99 and others added 3 commits November 13, 2025 17:26
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@hiyuchang hiyuchang changed the title Add training on browse comp plus as an training example [Example] browse comp plus Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant