🌐 WebRetriever: A Large-Scale Comprehensive Benchmark for Efficient Web Agent Evaluation

📃 Paper • 🏆 Leaderboard • 🤗 Data (Release Soon) • 🔤 English | 中文

💡 Motivation

Figure 1. Motivation for the WebRetriever benchmark. WebRetriever addresses key limitations of prior work from three aspects: dataset scale and diversity, automated evaluation reliability, and deployment-oriented evaluation protocols.

📄 Abstract

As web agents increasingly demonstrate capabilities in automated task execution, the development of robust evaluation frameworks for assessing their navigation and task completion performance has emerged as a critical research priority. However, existing benchmarks exhibit several fundamental limitations. First, they suffer from insufficient scale and limited domain diversity, thereby constraining comprehensive evaluation of cross-domain generalization. Second, prevailing LLM-as-Judge evaluation methodologies inadequately capture fine-grained interaction semantics, particularly regarding precise query formulation and filtering operations. Third, current benchmarks predominantly emphasize navigation success metrics while neglecting critical requirements for real-world deployment scenarios. To address these limitations, we introduce WebRetriever, a large-scale benchmark encompassing 800 websites and 1,500 tasks across diverse domains, including consumer, professional, and enterprise sectors, with comprehensive coverage of user intent patterns. We propose NavEval (Navigation Evaluation), a novel LLM-as-Judge framework that leverages rich interaction context beyond visual screenshots, achieving state-of-the-art alignment with human judgment across multiple evaluation datasets. Furthermore, we establish three complementary evaluation protocols that collectively provide holistic assessment of web agent capabilities: navigation proficiency, knowledge-assisted interaction, and end-to-end task completion with information extraction. Extensive experimental analysis reveals substantial performance disparities across evaluation protocols, demonstrating that navigation success alone serves as an insufficient predictor of real-world application effectiveness. WebRetriever delivers fine-grained diagnostic insights into agent capabilities and establishes a rigorous foundation for advancing web agent research and development.

⭐ Main Contributions

A large-scale, comprehensive benchmark for realistic web agent evaluation:
We curate 1,500 tasks across 800 real websites spanning diverse domains and user intents. Compared with prior benchmarks, WebRetriever provides unprecedented scale, diversity, and coverage, enabling more comprehensive and representative evaluation of web agents in realistic online environments.

A general and high-precision automated evaluation method:
We propose NavEval, an automated evaluation method that attains approximately 90% human-level agreement in large-scale experiments, thereby enabling practical and reliable assessment of web agent performance at scale and in real-time.

Comprehensive evaluation framework:
We propose three complementary evaluation protocols to systematically assess web agents, explicitly disentangling navigation success from answer correctness and characterizing behavioral reliability under injected operational knowledge, thereby providing diagnostic signals missing from prior benchmarks.

📊 Dataset Construction

Table 1. Comparison between WebRetriever and related benchmarks. Intent-Type: task intent type (G: general, P: professional, G&P: both); Setting: the evaluation environment configuration; Online: whether online live connection evaluation is supported in real-world environments; Interactive: whether the environment allows interaction; Websites: number of websites; Eval-Tasks: number of evaluation tasks.

🧠 NavEval

Figure 2. Workflow of NavEval. Compared to existing methods, NavEval applies rule-based filtering to extract fine-grained intermediate signals, which are then jointly reasoned with the task description, action trajectory, and final screenshot by an LLM to determine task success, enabling robust evaluation with higher human agreement rates.

📋 Evaluation Protocols

Figure 3. Workflow of the semi-automated pipeline for constructing operational documentation in Protocol II. The process integrates automated exploration, evaluation, manual refinement, and LLM-based generation to produce high-quality operational documentation.

We design three complementary evaluation protocols for comprehensive assessment:

Protocol I evaluates basic navigation ability to reach target pages.

Protocol II assesses navigation performance when provided with operational knowledge.

Protocol III measures end-to-end task completion by jointly evaluating navigation and information extraction, avoiding the limitation of equating page arrival with task success.

📈 Experiment Results

Table 2. Task Success Rate (SR) of web agent trajectories on WebRetriever across the three proposed evaluation protocols, assessed using NavEval and human annotation, respectively. All values are reported as percentages (%).

Table 3. Human Agreement Rate (AR) of web agent trajectories on WebRetriever across automated evaluation methods with different LLM-as-a-Judge models. Avg AR denotes the average human agreement rate. All values are reported as percentages (%).

Table 4. Human Agreement Rate (AR) of web agent trajectories on Online-Mind2Web across automated evaluation methods with different LLM-as-a-Judge models. Avg AR denotes the average human agreement rate. All values are reported as percentages (%).

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
README.md		README.md
README_zh.md		README_zh.md
figure1.png		figure1.png
figure2.png		figure2.png
figure3.png		figure3.png
table1.png		table1.png
table2.png		table2.png
table3.png		table3.png
table4.png		table4.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌐 WebRetriever: A Large-Scale Comprehensive Benchmark for Efficient Web Agent Evaluation

💡 Motivation

📄 Abstract

⭐ Main Contributions

📊 Dataset Construction

🧠 NavEval

📋 Evaluation Protocols

📈 Experiment Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Folders and files

Latest commit

History

Repository files navigation

🌐 WebRetriever: A Large-Scale Comprehensive Benchmark for Efficient Web Agent Evaluation

💡 Motivation

📄 Abstract

⭐ Main Contributions

📊 Dataset Construction

🧠 NavEval

📋 Evaluation Protocols

📈 Experiment Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Packages