🏆 Hosted Tasks & Live Leaderboard (realevals.xyz)
Build, evaluate, and benchmark AI agents on deterministic, high-fidelity web replicas of real-world apps.
Screen.Recording.2025-05-15.at.2.48.39.PM.mov
REAL is a toolkit for building and evaluating browser-based AI agents in realistic, reproducible environments.
It powers realevals.xyz — a public leaderboard and evaluation platform for agents navigating complex web apps, including replicas of Amazon, DoorDash, Airbnb, and more.
- Train and benchmark agents with robust, standardized tasks
- Use plug-and-play LLMs or your own agent logic
- Evaluate capabilites via deterministic simulations, ensuring scientific reproducibility
TL;DR: Go from “idea” to “benchmarked agent” in under a minute.
# Install the SDK
git clone https://github.com/agi-inc/REAL.git
cd REAL
pip install -e ./
# Install Playwright browser dependencies
playwright install --force
# Set your LLM API key (for evaluation)
export OPENAI_API_KEY="your-api-key"
✅ Supports OpenAI, Anthropic, OpenRouter, and custom models.
On Apple Silicon run brew install --cask playwright
first.
Minimal agent benchmarking example:
from agisdk import REAL
harness = REAL.harness(
model="gpt-4o", # Any LLM tag or custom agent
task_type="omnizon", # Amazon-like store
headless=False # Watch it operate!
)
print(harness.run())
More agent examples in the example folder.
- Fully deterministic replicas of top real-world web apps (Amazon, Uber, Gmail, Airbnb, etc.)
- Robust agent API: observations, actions, memory, error handling
- Leaderboard integration (realevals.xyz)
- Plug in your own agents or models
- Supports multiple providers and custom models
- Parallelized evaluation
See example/README.md
and sample agents:
example/starter.py
— basic agent setupexample/custom.py
— custom agent logicexample/nova.py
— browser-based custom agent (e.g. Amazon NovaAct)example/hackable.py
— highly configurable agent shell
Task suite covers realistic user flows in modern web apps such as:
App Replica | Task Prefix | Example Use Case |
---|---|---|
🛒 Amazon | webclones.omnizon-* |
Buy a laptop, find a gift |
🍔 DoorDash | webclones.dashdish-* |
Order dinner |
webclones.fly-unified-* |
Book a flight | |
🏡 Airbnb | webclones.staynb-* |
Reserve accommodation |
📅 Google Calendar | webclones.gocalendar-* |
Schedule a meeting |
📬 Gmail | webclones.gomail-* |
Compose an email |
🍽️ OpenTable | webclones.opendining-* |
Book a restaurant |
webclones.networkin-* |
Accept a connection | |
🚗 Uber | webclones.udriver-* |
Book a ride |
💼 UpWork | webclones.topwork-* |
Find a freelance gig |
🏠 Zillow | webclones.zilloft-* |
Browse houses |
All tasks use human-written goals to stress-test agent behavior.
For Anthropic:
export ANTHROPIC_API_KEY="your-anthropic-api-key"
Other providers supported as well.
Agents receive structured observations, including:
{
'chat_messages': [...],
'goal': "...",
'goal_object': [...],
'open_pages_urls': [...],
'active_page_index': 0,
'url': "...",
'screenshot': np.array(...),
'dom_object': {...},
'axtree_object': {...},
'extra_element_properties': {...},
'focused_element_bid': "...",
'last_action': "...",
'last_action_error': "...",
'elapsed_time': 0.0,
'browser': {...}
}
Agents specify actions as function-call strings:
"goto('https://www.google.com')"
"go_back()"
"go_forward()"
"click('element_id')"
"fill('input_id', 'your text')"
"press('Enter')"
"send_msg_to_user('I found the answer: $42.99')"
"report_infeasible('The requested item is out of stock')"
Available arguments for REAL.harness
:
REAL.harness(
model="gpt-4o", # or other model/provider
agentargs=MyAgentArgs(), # custom agent config
task_name="webclones.omnizon-1", # specific task (optional)
task_type="omnizon", # task category
task_id=1,
headless=False, # GUI
max_steps=25,
browser_dimensions=(1280, 720),
use_html=False,
use_axtree=True,
use_screenshot=True,
leaderboard=False,
run_id="my_unique_id",
parallel=False,
num_workers=4,
use_cache=True,
cache_only=False,
force_refresh=False,
results_dir="./results"
)
Fair Use Notice:
This repository and benchmark provide deterministic, non-commercial simulations of real-world websites for research, development, and evaluation of autonomous agents.
All website replicas are built for academic benchmarking and do not contain proprietary code, content, or branding of the original services.
All trademarks and trade names belong to their respective owners. If you are a rights holder and have concerns, please contact us.
Code and evaluation suite on this repo will remain frozen