REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

🏆 Hosted Tasks & Live Leaderboard (realevals.xyz)

Build, evaluate, and benchmark AI agents on deterministic, high-fidelity web replicas of real-world apps.

Screen.Recording.2025-05-15.at.2.48.39.PM.mov

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

REAL is a toolkit for building and evaluating browser-based AI agents in realistic, reproducible environments.

It powers realevals.xyz — a public leaderboard and evaluation platform for agents navigating complex web apps, including replicas of Amazon, DoorDash, Airbnb, and more.

Train and benchmark agents with robust, standardized tasks
Use plug-and-play LLMs or your own agent logic
Evaluate capabilites via deterministic simulations, ensuring scientific reproducibility

TL;DR: Go from “idea” to “benchmarked agent” in under a minute.

🛠️ Installation

# Install the SDK
git clone https://github.com/agi-inc/REAL.git

cd REAL

pip install -e ./

# Install Playwright browser dependencies
playwright install --force

# Set your LLM API key (for evaluation)
export OPENAI_API_KEY="your-api-key"

✅ Supports OpenAI, Anthropic, OpenRouter, and custom models.
On Apple Silicon run brew install --cask playwright first.

⏱️ 60-second Quick Start

Minimal agent benchmarking example:

from agisdk import REAL

harness = REAL.harness(
    model="gpt-4o",                 # Any LLM tag or custom agent
    task_type="omnizon",            # Amazon-like store
    headless=False                  # Watch it operate!
)

print(harness.run())

More agent examples in the example folder.

🔥 Features

Fully deterministic replicas of top real-world web apps (Amazon, Uber, Gmail, Airbnb, etc.)
Robust agent API: observations, actions, memory, error handling
Leaderboard integration (realevals.xyz)
Plug in your own agents or models
Supports multiple providers and custom models
Parallelized evaluation

Running Custom Agents

See example/README.md and sample agents:

example/starter.py — basic agent setup
example/custom.py — custom agent logic
example/nova.py — browser-based custom agent (e.g. Amazon NovaAct)
example/hackable.py — highly configurable agent shell

🌐 Available Tasks

Task suite covers realistic user flows in modern web apps such as:

App Replica	Task Prefix	Example Use Case
🛒 Amazon	`webclones.omnizon-*`	Buy a laptop, find a gift
🍔 DoorDash	`webclones.dashdish-*`	Order dinner
✈️ United	`webclones.fly-unified-*`	Book a flight
🏡 Airbnb	`webclones.staynb-*`	Reserve accommodation
📅 Google Calendar	`webclones.gocalendar-*`	Schedule a meeting
📬 Gmail	`webclones.gomail-*`	Compose an email
🍽️ OpenTable	`webclones.opendining-*`	Book a restaurant
👔 LinkedIn	`webclones.networkin-*`	Accept a connection
🚗 Uber	`webclones.udriver-*`	Book a ride
💼 UpWork	`webclones.topwork-*`	Find a freelance gig
🏠 Zillow	`webclones.zilloft-*`	Browse houses

All tasks use human-written goals to stress-test agent behavior.

🔑 API Keys

For Anthropic:

export ANTHROPIC_API_KEY="your-anthropic-api-key"

Other providers supported as well.

👁️ Observation Structure

Agents receive structured observations, including:

{
  'chat_messages': [...],
  'goal': "...",
  'goal_object': [...],
  'open_pages_urls': [...],
  'active_page_index': 0,
  'url': "...",
  'screenshot': np.array(...),
  'dom_object': {...},
  'axtree_object': {...},
  'extra_element_properties': {...},
  'focused_element_bid': "...",
  'last_action': "...",
  'last_action_error': "...",
  'elapsed_time': 0.0,
  'browser': {...}
}

🎯 Actions

Agents specify actions as function-call strings:

"goto('https://www.google.com')"
"go_back()"
"go_forward()"
"click('element_id')"
"fill('input_id', 'your text')"
"press('Enter')"
"send_msg_to_user('I found the answer: $42.99')"
"report_infeasible('The requested item is out of stock')"

⚙️ Harness Configuration

Available arguments for REAL.harness:

REAL.harness(
    model="gpt-4o",                   # or other model/provider
    agentargs=MyAgentArgs(),          # custom agent config
    task_name="webclones.omnizon-1",  # specific task (optional)
    task_type="omnizon",              # task category
    task_id=1,
    headless=False,                   # GUI
    max_steps=25,
    browser_dimensions=(1280, 720),
    use_html=False,
    use_axtree=True,
    use_screenshot=True,
    leaderboard=False,
    run_id="my_unique_id",
    parallel=False,
    num_workers=4,
    use_cache=True,
    cache_only=False,
    force_refresh=False,
    results_dir="./results"
)

⚖️ Disclaimer

Fair Use Notice:
This repository and benchmark provide deterministic, non-commercial simulations of real-world websites for research, development, and evaluation of autonomous agents.
All website replicas are built for academic benchmarking and do not contain proprietary code, content, or branding of the original services.
All trademarks and trade names belong to their respective owners. If you are a rights holder and have concerns, please contact us.

Code and evaluation suite on this repo will remain frozen

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
dist		dist
docs		docs
example		example
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
starter.py		starter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

🛠️ Installation

⏱️ 60-second Quick Start

🔥 Features

Running Custom Agents

🌐 Available Tasks

🔑 API Keys

👁️ Observation Structure

🎯 Actions

⚙️ Harness Configuration

⚖️ Disclaimer

About

Uh oh!

Releases

Packages

Languages

License

agi-inc/REAL

Folders and files

Latest commit

History

Repository files navigation

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

🛠️ Installation

⏱️ 60-second Quick Start

🔥 Features

Running Custom Agents

🌐 Available Tasks

🔑 API Keys

👁️ Observation Structure

🎯 Actions

⚙️ Harness Configuration

⚖️ Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages