Note
Code for the paper A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments.
Environments built for people are increasingly operated by a new class of economic actors: LLM-powered software agents making decisions on our behalf. These decisions range from our purchases to travel plans to medical treatment selection. Current evaluations of these agents largely focus on task competence, but we argue for a deeper assessment: how these agents choose when faced with realistic decisions. We introduce ABxLab, a framework for systematically probing agentic choice through controlled manipulations of option attributes and persuasive cues. We apply this to a realistic web-based shopping environment, where we vary prices, ratings, and psychological nudges, all of which are factors long known to shape human choice.
- 📊 Configurable A/B testing with interventions and preferences for any website
- 🤖 Support for multiple LLM agents through various providers (e.g. LiteLLM)
- 🛒 E-commerce shopping task environment with realistic browsing scenarios
- 🔍 AgentXray visualization tool for debugging agent behavior
- ⚙️ Hydra-based configuration system for reproducible experiments
- Python 3.11 or 3.12
- Node.js and npm (for Playwright)
- R (optional, for statistical analysis)
conda create -n abxlab python=3.11
conda activate abxlabUsing pip:
pip install -r requirements.txt
cd ./agentlab && pip install -e . && cd ..Or using uv (recommended for faster installs):
pip install uv
uv pip install -r requirements.txt
cd ./agentlab && uv pip install -e . && cd ..Playwright is required for browser automation:
playwright installIn a few scripts we use DSPy, but it conflicts with hydra-ray-launcher so you can install it separately:
pip install dspy==2.6.27Create a .env file in the project root with the following configuration:
Important
Due to AgentLab and BrowserGym dependencies, you must set all these endpoints to avoid runtime errors. We don't host WebArena environments, but you can deploy them following these instructions.
# Base URL for web agent environments
BASE_WEB_AGENT_URL="<YOUR_SERVER_URL>"
# Primary Endpoints (can point to different environments)
SHOPPING="${BASE_WEB_AGENT_URL}"
SHOPPING_ADMIN="${BASE_WEB_AGENT_URL}"
REDDIT="${BASE_WEB_AGENT_URL}"
GITLAB="${BASE_WEB_AGENT_URL}"
WIKIPEDIA="${BASE_WEB_AGENT_URL}"
MAP="${BASE_WEB_AGENT_URL}"
HOMEPAGE="${BASE_WEB_AGENT_URL}"
# Synced WA_ Prefixed Vars (required by BrowserGym)
WA_SHOPPING="${SHOPPING}"
WA_SHOPPING_ADMIN="${SHOPPING_ADMIN}"
WA_REDDIT="${REDDIT}"
WA_GITLAB="${GITLAB}"
WA_WIKIPEDIA="${WIKIPEDIA}"
WA_MAP="${MAP}"
WA_HOMEPAGE="${HOMEPAGE}"
# Results will be saved in this directory
AGENTLAB_EXP_ROOT="results"
# LLM API Keys (add only the ones you plan to use)
OPENAI_API_KEY="<YOUR_OPENAI_KEY>"
ANTHROPIC_API_KEY="<YOUR_ANTHROPIC_KEY>"
GEMINI_API_KEY="<YOUR_GEMINI_KEY>"
AWS_REGION_NAME="<AWS_REGION>"
AWS_ACCESS_KEY_ID="<AWS_ACCESS_KEY_ID>"
AWS_SECRET_ACCESS_KEY="<YOUR_AWS_KEY>"Note
Results are saved to AGENTLAB_EXP_ROOT defined in the .env above.
Tip
The results contain the raw data. You can adapt scripts/collect_results.py, which transforms results into a CSV file that is easier to analyze.
The main configuration file conf/config.yaml defines the abxlab_url used throughout the codebase. By default, we choose one of the variables defined in .env above, but you can replace it.
env:
abxlab_url: ${oc.env:WA_SHOPPING}The easiest way to run ABxLab is with a configuration file like the example in ABxLab/conf/task/test/basic.yaml. This works out of the box if you set BASE_WEB_AGENT_URL="https://www.amazon.com/" in .env, and it's easy to adapt!
start_urls: This defines the URLs the agent will see. In this example, the homepage.intent_template: This defines the goal of the task. In this example, searching for a "toy papaya".choices: This defines all intervention functions. In this case, it includes a nudge below the product title.eval: This defines the stopping condition. In this example, once a product is added to the cart.
python run.py task=test/basicYou can visualize the results with AgentXray.
There are other useful ways of running experiments. For example, you can also run any of the experiments in conf/ as:
python run.py +experiment-regular=exp10For more elaborated experiments, you can generate all configurations programmatically. In the shopping environment, the script scripts/generate_experiments.py generates experiment configurations in --exp-dir from the data in tasks/:
python scripts/generate_experiments.py --match-price --match-review-count --products=tasks/product_pairs-matched-ratings.csv --exp-dir conf/experimentThen, one can run all these experiments with multirun
# Define the range of experiment IDs you want to run (e.g., exp0 through N)
N=100
EXPS=$(echo exp{0..$N} | tr ' ' ',')
python run.py --multirun "+experiment=${EXPS}"Warning
Multirun can generate very large files because AgentLab prints out all uncommitted files in the directory. Consider including them in .gitignore to avoid these issues.
ABxLab uses Hydra for configuration management. You can override any configuration parameter from the command line.
Supported models and providers are in conf/agent/, which can be easily extended. We use LiteLLM by default, but you can find more details below.
# Use GPT 5
python run.py agent=gpt-5
# Use Claude 4.5 Sonnet
python run.py agent=claude-4-5-sonnet
# Use Gemini 2.5 Pro
python run.py agent=gemini-2-5-pro
# Use DeepSeek R1
python run.py agent=deepseek-r1ABxLab/
├── abxlab/ # Core ABxLab modules
│ ├── actions.py # Custom agent action definitions
│ ├── browser.py # Custom browser env to execute intervention functions
│ ├── evaluators.py # Custom task evaluation logic
│ ├── task.py # Custom task definitions
│ └── choices/ # Intervention functions for each environment
├── agentlab/ # Modified version of AgentLab
├── analysis/ # R scripts for statistical analysis
├── conf/ # Hydra configuration files
│ ├── agent/ # Agent configurations (GPT, Claude, etc.)
│ ├── benchmark/ # Benchmark configurations
│ ├── task/ # Task configurations
│ └── config.yaml # Main config file
├── scripts/ # Scripts for generating experiments, collecting results, etc
│ ├── generate_experiments.py
│ ├── collect_results.py
│ └── ...
├── tasks/ # Data for generating experiments
│ ├── products.csv
│ ├── interventions.csv
│ └── ...
├── run.py # Main experiment runner
└── requirements.txt # Python dependencies
You can create new tasks in conf/task/ and use shopping.yaml or test/basic.yaml as inspiration. Most of this logic is inherited from WebArena, so we refer the reader there for details. We modify it with:
entrypoint: abxlab.task.ABxLabShopTask: This is a custom class where we run logic that we always need for the shopping environment. Otherwise, you can use its parententrypoint: abxlab.task.ABxLabTask.config.choices: This is a placeholder, which you can copy and paste. Your configs (e.g. conf/task/test) should inherit this config, and you can replacechoiceswith either an empty list (no interventions needed) or a list of functions following the details below.
We also define our own config.eval, which is a stopping condition.
ABxLab allows you to define a set of intervention functions in the configurations. If an agent visits a matching URL, then all functions get executed sequentially. Each function receives the HTML (by default) and a set of arguments defined in the configuration file. The field nudge can be used as an identifier to recognize during analysis. You can see an example here.
The ABxLab benchmark can be used as is in most cases. It's worth noting that here is where we define the high level actions available for agents, which we customized here to remove unnecessary actions available in BrowserGym.
You can see the agent's default flags here. You can see more details in AgentLab, but here you can decide whether to use thinking, memory, pruned HTML or accessibility trees, etc.
We included support for LiteLLM and set it by default in all agent configurations in conf/agent/. However, there are other options available like OpenRouter that you can see here.
AgentXray is a Gradio-based visualization tool by AgentLab for debugging and analyzing agent behavior.
AgentXray.demo.mov
Export the environment variable to specify the path for the results, and then launch AgentXray
export AGENTLAB_EXP_ROOT=./results
agentlab-xrayReach out to us! We have hundreds of GBs of data.
If you use ABxLab in your research, please cite the following paper:
@article{cherep2025framework,
title={A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments},
author={Manuel Cherep and Chengtian Ma and Abigail Xu and Maya Shaked and Pattie Maes and Nikhil Singh},
year={2025},
url={https://arxiv.org/abs/2509.25609},
}Research reported in this publication was supported by an Amazon Research Award, Fall 2024. We also received funding from SK Telecom in partnership with the MIT Generative AI Impact Consortium (MGAIC). Experiments conducted in this paper were generously supported via API credits provided by OpenAI, Anthropic, and Google. MC is supported by a fellowship from “la Caixa” Foundation (ID 100010434) with code LCF/BQ/EU23/12010079.
This project builds on AgentLab and BrowserGym, for which we are thankful.