-
Notifications
You must be signed in to change notification settings - Fork 42
[Example] browse comp plus #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Example] browse comp plus #377
Conversation
Summary of ChangesHello @garyzhang99, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new, self-contained training example for the BrowseComp-Plus dataset, focusing on developing ReAct-based search agents. It provides all necessary components, including a new workflow that integrates local search capabilities and an LLM-as-judge for evaluation, a configuration file, and a utility script for dataset preparation. The example aims to demonstrate how to train agents that can reason and act using search tools to answer questions, with clear setup and execution instructions. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new training example for the BrowseComp-Plus dataset, which is a great addition. The implementation includes a new workflow, data preparation script, configuration, and documentation. The code is generally well-structured and follows good practices. I've identified a few areas for improvement, mainly concerning robustness and consistency. My key feedback includes addressing a hardcoded tokenizer path, fixing unsafe string formatting that could break JSON parsing, and ensuring consistency in user-facing instructions. Please see the detailed comments for specific suggestions.
| def __init__( | ||
| self, | ||
| searcher, | ||
| snippet_max_tokens: int | None = None, | ||
| k: int = 5, | ||
| include_get_document: bool = True, | ||
| ): | ||
| self.searcher = searcher | ||
| self.snippet_max_tokens = snippet_max_tokens | ||
| self.k = k | ||
| self.include_get_document = include_get_document | ||
|
|
||
| self.tokenizer = None | ||
| if snippet_max_tokens and snippet_max_tokens > 0: | ||
| try: | ||
| from transformers import AutoTokenizer | ||
|
|
||
| self.tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B") | ||
| except Exception: | ||
| self.tokenizer = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tokenizer path is hardcoded to Qwen/Qwen3-0.6B, which may not match the agent model specified in the configuration (e.g., Qwen/Qwen3-4B-Instruct-2507). Using a mismatched tokenizer for truncation can lead to incorrect behavior and is not robust.
To fix this, you should pass the model's path from the workflow to the SearchToolHandler and use it to load the correct tokenizer. You'll need to update the __init__ method here and then pass self.model_name when instantiating SearchToolHandler in _init_searcher.
def __init__(
self,
searcher,
snippet_max_tokens: int | None = None,
k: int = 5,
include_get_document: bool = True,
tokenizer_path: Optional[str] = None,
):
self.searcher = searcher
self.snippet_max_tokens = snippet_max_tokens
self.k = k
self.include_get_document = include_get_document
self.tokenizer = None
if snippet_max_tokens and snippet_max_tokens > 0:
try:
from transformers import AutoTokenizer
path_to_load = tokenizer_path or "Qwen/Qwen3-0.6B"
self.tokenizer = AutoTokenizer.from_pretrained(path_to_load)
except Exception:
self.tokenizer = None| judge_prompt = f"""You are an expert judge tasked with evaluating whether an agent's response to a question is correct. | ||
|
|
||
| **Question**: | ||
| <question> | ||
| {question} | ||
| </question> | ||
|
|
||
| **Ground Truth Answer**: | ||
| <correct_answer> | ||
| {correct_answer} | ||
| </correct_answer> | ||
|
|
||
| **Agent's Response**: | ||
| <response> | ||
| {final_answer} | ||
| </response> | ||
|
|
||
| Your task is to determine if the agent's response is correct based on the ground truth answer. Be strict and precise in your judgment. | ||
|
|
||
| **Evaluation Criteria**: | ||
| 1. Extract the final answer from the agent's response | ||
| 2. Compare it with the ground truth answer | ||
| 3. The agent's answer is correct ONLY if it is semantically equivalent to the ground truth | ||
| 4. Allow for minor variations in phrasing, but the core information must match exactly | ||
| 5. For numerical answers, allow small rounding differences (within 1% or 0.1 units) | ||
| 6. If the agent's response contains additional correct information beyond the ground truth, it can still be marked as correct | ||
| 7. If the agent's response is ambiguous, contradictory, or contains incorrect information, mark it as incorrect | ||
| 8. If the agent did not provide a clear final answer, mark it as incorrect | ||
|
|
||
| **Output Format**: You MUST respond with a valid JSON object in the following format (no additional text): | ||
|
|
||
| {{ | ||
| "extracted_answer": "The exact answer extracted from the agent's response, or null if no answer was provided", | ||
| "ground_truth": "{correct_answer}", | ||
| "reasoning": "Brief explanation of why the extracted answer is correct or incorrect compared to the ground truth. If no answer was provided in the agent's response, mark it as incorrect.", | ||
| "is_correct": true or false | ||
| }} | ||
|
|
||
| Respond ONLY with the JSON object, no additional text before or after.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct_answer is directly embedded into the JSON structure of the judge_prompt using an f-string. If correct_answer contains special characters like double quotes, it will break the JSON format in the prompt, likely causing the judge model to fail or produce incorrect output.
To prevent this, you should use json.dumps() to safely serialize the correct_answer into a valid JSON string within the prompt.
judge_prompt = f"""You are an expert judge tasked with evaluating whether an agent's response to a question is correct.
**Question**:
<question>
{question}
</question>
**Ground Truth Answer**:
<correct_answer>
{correct_answer}
</correct_answer>
**Agent's Response**:
<response>
{final_answer}
</response>
Your task is to determine if the agent's response is correct based on the ground truth answer. Be strict and precise in your judgment.
**Evaluation Criteria**:
1. Extract the final answer from the agent's response
2. Compare it with the ground truth answer
3. The agent's answer is correct ONLY if it is semantically equivalent to the ground truth
4. Allow for minor variations in phrasing, but the core information must match exactly
5. For numerical answers, allow small rounding differences (within 1% or 0.1 units)
6. If the agent's response contains additional correct information beyond the ground truth, it can still be marked as correct
7. If the agent's response is ambiguous, contradictory, or contains incorrect information, mark it as incorrect
8. If the agent did not provide a clear final answer, mark it as incorrect
**Output Format**: You MUST respond with a valid JSON object in the following format (no additional text):
{{
"extracted_answer": "The exact answer extracted from the agent's response, or null if no answer was provided",
"ground_truth": {json.dumps(correct_answer)},
"reasoning": "Brief explanation of why the extracted answer is correct or incorrect compared to the ground truth. If no answer was provided in the agent's response, mark it as incorrect.",
"is_correct": true or false
}}
Respond ONLY with the JSON object, no additional text before or after."""
trinity/common/workflows/envs/browse_comp_plus/bcp_simple_react_workflow.py
Show resolved
Hide resolved
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Description
As the title says.
Checklist
Please check the following items before code is ready to be reviewed.