-
Notifications
You must be signed in to change notification settings - Fork 16
Web Bench Introduction
The new benchmark Web-Bench is designed to evaluate the coding capabilities of LLMs on real-world web projects 🌐. The related work has been released on arXiv, GitHub and HuggingFace, along with a public Leaderboard.
- Structure: Web-Bench consists of 50 independently developed real-world web projects 🌐, covering common business scenarios such as e-commerce, blogs, spreadsheets, drawing boards, and mini-games. These projects include core Web Standards (DOM, BOM, CSS, SVG, Canvas, ECMAScript, TypeScript, etc.) and popular Frameworks (React, Vue, Redux, Tailwind, Next.js, Prisma, etc.).
- Evaluation: Various LLMs are tested using our custom benchmark agent, and results are verified using Playwright E2E test cases. The highest Pass@1 (one-shot success rate) is only 25.1% (achieved by Claude-3-7-Sonnet), meaning the LLM can complete about the first 5 out of 20 tasks on average. The best Pass@2 (re-run with error context) reaches only 35.3%.
- Complexity: From a frontend engineering perspective, these projects are of low to medium complexity. Clearly, there is still significant room for improvement 💪🏻 in LLM coding for real-world projects. The original motivation for this work was to combine frontend engineering experience to build a practical and meaningful benchmark for this subfield of LLM programming.
In the second half of 2024, after extensively reviewing papers on LLM code benchmarks, we identified several common issues and opportunities:
Benchmark | Problem | Web-Bench |
---|---|---|
Team | In most leading benchmarks, the majority of researchers are professional academics or graduate students, with limited hands-on experience in real-world industry projects. This may explain why nearly all benchmarks are built from competition problems or existing open-source projects — partly due to the availability of ready-made data. | In contrast, we are a team of frontend developers with practical engineering experience, giving us a relative advantage in constructing project-level benchmarks, particularly for classic browser–server web projects. |
Benchmark Structure | Among existing benchmarks, SWE-Bench is closest to our goal. It sources data from GitHub pull requests, and the dataset consists of a collection of discrete tasks. | By manually constructing projects, we can focus more explicitly on capabilities required in real-world products. The evaluation is designed with progressive difficulty, similar to a level-based game experience. At the same time, we aim to cover the standards and frameworks commonly used in web development, enabling a more comprehensive assessment of LLM web coding capabilities. |
Benchmark Saturation | Earlier benchmarks (before 2022) are gradually becoming saturated, which weakens their ability to evaluate and guide LLM development. For example, HumanEval has reached a Pass@1 of 99.4%, and MBPP is at 94.2%. | On our benchmark agent, the current SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1, which is significantly lower than SWE-Bench’s 65.4% (Verified) and 33.8% (Full), leaving ample headroom for future progress. Benchmarks built around software engineering tasks with moderate complexity can better distinguish between models’ capabilities. |
The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks primarily focused on code generation accuracy, but these benchmarks have gradually become saturated. Benchmark saturation weakens their guiding role for LLMs. For example, HumanEval Pass@1 has reached 99.4% and MBPP 94.2%. Among various attempts to address benchmark saturation, approaches based on software engineering have stood out, but the saturation of existing software engineering benchmarks is rapidly increasing. To address this, we propose a new benchmark, Web-Bench, which contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5 to 10 years of experience, each presents a significant challenge. On average, a single project takes 4 to 8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1, significantly lower (better) than SWE-Bench's Verified (65.4%) and Full (33.8%) scores. Finally, we discuss that in any development field, Standards and Frameworks represent foundational knowledge and efficiency tools, respectively, and LLMs require optimization tailored to them.
The Web-Bench dataset contains 50 projects, each consisting of 20 tasks with sequential dependencies.
- Dataset: a collection of projects. Due to the evaluation cost, we plan to provide a Lite version in the future.
- Project: a single item in the dataset.
- Task: including a description and several end-to-end (E2E) testcases. On average, there are 3.6 cases per task and 72.4 cases per project.
Task-2 depends on the execution result (’main’ element) of Task-1:

Use Web-Bench evaluation tool to Calibrate a project. That is, check for design issues in the project:
- Task issues, such as ambiguous descriptions, vulnerabilities, or obvious conflicts with testcases.
- Testcase issues, such as not rigorous enough or too strict.
Calibrated projects are "stable projects", and 50 such stable projects have been open-sourced to date.
Below are canonical solutions (human-written code) for some of these stable projects. From the videos, it is also evident that these projects represent modules within larger human-developed systems. Nonetheless, even at this level, they remain highly challenging for current LLMs. As LLM capabilities continue to improve, Web-Bench should also evolve accordingly.
![]() ![]() ![]() |
![]() ![]() ![]() ![]() |
![]() ![]() ![]() ![]() ![]() |

Web-Agent is an instance of the agent described in "Evaluator" Step 3. Web-Agent is the interaction module between Web-Bench and LLM. Its main responsibilities are to build prompts, request LLM API, and extract files from LLM response.

Take the calculator project as an example:

After completing all 20 tasks, the human-written code is as follows:
calculator-final.mp4

eval-calculator-3x-720.mp4
As shown in the following figure, HumanEval and MBPP have approached saturation. APPS and EvalPlus are approaching saturation. The SOTA for Web-Bench is 25.1%, which is lower (better) than that of the SWE-bench Full and Verified sets.

'claude-3-7-sonnet-20250219-thinking' achieves the highest Pass@1 and Pass@2 (using Best-of-5 sampling, as used throughout).
- From the Pass@1 distribution, Claude-series models exhibit the highest performance, followed by GPT, Doubao, DeepSeek, LLAMA, Gemini, Qwen, and others.
- From the Pass@2 distribution, Claude-series models exhibit the highest performance, followed by Doubao, DeepSeek, GPT, Gemini, LLAMA, Mistral, and others.


The Pass@2 distribution across LLM series indicates that newer or larger-parameter models generally perform better in Web-Bench, which aligns with the commonly observed scaling law.


regardless of Pass@1 or Pass@2, closed-source models (shown in red) also appear to have an advantage.

For a more in-depth analysis — such as the working details of the Evaluator and benchmark Agent, or discussions on the future directions of coding benchmarks — please refer to the full paper https://arxiv.org/abs/2505.07473 .
- Evaluate LLM or Agent
- Download Dataset
- Browse Leaderboard, and submit evaluation result
- Explore more Web Standards and Frameworks
- Explore more business scenarios
- Explore more software engineer domains, e.g. backend, BI
- Improve Project Construction Efficiency