A LLM-based Browser Agent that automatically executes web tasks from natural-language user instructions.
This project is not about building a “perfect Browser Agent.” Instead, it is meant to validate a research hypothesis:
In the Browser Agent setting, can multi-agent collaboration plus input-compression strategies match or even surpass the efficiency of the current open-source SOTA solution, Browser Use?
Current conclusion: this direction is feasible, and some experimental metrics already show strong potential. That said, Browser Use still has clear advantages in model training and engineering maturity.
- Natural-language-driven automation: provide a task description, and the agent explores pages and completes the task automatically.
- Multi-agent collaborative scheduling: a task-scheduling-based collaboration mechanism where models of different sizes cooperate in parallel/serial modes to balance execution efficiency and decision accuracy.
- Hierarchical compression mechanism: a layered compression and pruning pipeline for redundant Web GUI input tokens, reducing token usage by roughly one order of magnitude.
- Observable and traceable execution: automatic saving of logs, process screenshots, structured outputs (JSON), and execution reports (Markdown + Gantt chart).
Evaluation model setup:
vlm_primary_service = qwen3.5-397b-a17bllm_primary_service = qwen3.5-397b-a17bvlm_secondary_service = qwen3-vl-2b-instructllm_secondary_service = qwen3-4b
On the Online-Mind2Web2 benchmark, the project achieved:
| Difficulty | Success Rate | Evaluation Coverage |
|---|---|---|
| Easy | 62/78 = 79.49% | 78/80 = 97.50% |
| Medium | 105/142 = 73.94% | 142/143 = 99.30% |
| Hard | 53/77 = 68.83% | 77/77 = 100.00% |
| Total | 220/297 = 74.07% | 297/300 = 99.00% |
Note: A small number of unfinished samples were mainly caused by environment factors (e.g., invalid web pages or Playwright runtime issues), not because the tasks themselves were unsolvable.
Token consumption per task (including both successful and failed runs):
| Difficulty | Primary Model Tokens (in / out) | Secondary Model Tokens (in / out) |
|---|---|---|
| Easy | 32806.75 / 870.71 | 53607.44 / 1599.94 |
| Medium | 61208.80 / 1624.87 | 99963.57 / 2883.44 |
| Hard | 82921.47 / 2123.79 | 136125.65 / 4176.77 |
| Overall | 59296.13 / 1554.09 | 97028.27 / 2877.38 |
Average latency per interaction is 11.17s. About 7.78s comes from LLM response time; the remainder is mainly spent on post-action page transitions and network loading.
The system has 6 agents with distinct responsibilities:
- Planning: understand current state and produce the next target.
- Act: execute browser actions (click, type, scroll, navigate, etc.).
- Observation: judge outcomes based on page changes before/after actions.
- Extraction: extract task-relevant key information from web pages.
- Feedback: assess and report task completion progress.
- Refinement: compress historical task context.
Core loop: Planning -> Act -> Observation (ReAct loop)
In real web pages, the DOM tree is usually very large and contains many elements irrelevant to the current task. Feeding all of it to the model at once often causes:
- Higher TTFT (time to first token): longer input leads to slower startup.
- Worse reasoning quality: too much noise makes the model easier to drift away from the goal.
-
Rule-based GUI filtering and compression
- Filter GUI elements by visibility, interactivity, etc.
- Compress GUI representation while preserving key information as much as possible.
-
Task-relevance filtering with a 2B small model
- Before each interaction, a 2B model filters out GUI elements irrelevant to the current task.
- Extra overhead is only about 0.6s.
- On average, this saves about 4K input tokens per interaction for the large model and significantly reduces noise.
Filtering visualization:
![]() |
![]() |
In one Browser Agent interaction, the model often needs to do several things at once:
- Evaluate whether the previous action was effective
- Reason about the current task state
- Plan the next step
- Generate the action instruction
Packing all objectives into a single response is overly complex and can increase reasoning errors.
-
Split by complexity into lighter subtasks
- Large model (A17B) handles critical logic-reasoning subtasks (e.g., Planning).
- Small model (2B) handles summarization-style subtasks (e.g., Observation).
-
Keep only necessary context for each subtask
- Some context overlap remains across subtasks.
- But input/output tokens for the large model decrease substantially while effectiveness is maintained.
-
Parallel scheduling for subtasks
- Parallelizable subtasks run concurrently.
- In some scenarios, downstream scheduling can be triggered as soon as key fields are available from upstream outputs.
- This further reduces interaction latency.
The parallelization effect can be seen directly in the Gantt chart:
Use uv to install Python dependencies:
uv syncActivate the virtual environment:
# linux/macOS
source .venv/bin/activate
# Windows
.venv\Scripts\activateplaywright installCopy and edit the configuration file:
cp env.example.json env.jsonFill in your model service settings in env.json:
vlm_primary_service/vlm_secondary_service: vision-language model (VLM) config names.llm_primary_service/llm_secondary_service: language model (LLM) config names.primarydenotes the larger model for key reasoning and decisions.secondarydenotes the smaller model for summarization and auxiliary processing.- Set each service’s
api_key,base_url,model, andtemperature.
python main.py \
--out-dir output/test \
--task "Find a tutorial video on Bilibili about deploying Qwen large models on a laptop" \
--start-url "https://www.bilibili.com/"demo.py provides a directly runnable example that is convenient for customization and debugging.
For each task run, files are generated under the directory specified by --out-dir, including:
log.log: runtime logsresult.json: structured execution resultreport.md: human-readable execution reportgantt.png: timeline Gantt chart- Stage screenshots: before/after actions, observation, planning, etc.
.
├── agent/
│ ├── agent.py # Main execution loop: planning/act/observation/extraction/feedback
│ ├── action.py # Browser action definitions and execution
│ ├── dom.py # DOM parsing, clustering, and compression logic
│ ├── llm.py # Multi-model invocation wrapper and token accounting
│ ├── config.py # Configuration initialization
│ └── record.py # Execution record schema
├── main.py # CLI entry
├── demo.py # Runnable example
├── env.example.json # Configuration template
└── pyproject.toml # Dependency configuration


