Synthetic Data Agent - Kedro Project

A Kedro-based AI browser automation project that generates synthetic data through intelligent web interactions. This project combines AI agents with web browser automation to create realistic test data and perform automated tasks on websites.

🎯 Project Overview

The Synthetic Data Agent is designed to:

Automate web browser interactions using AI-powered decision making
Generate synthetic test data through real web workflows
Record and replay browser sessions for testing purposes
Provide a web-based dashboard for managing agent tasks and recordings
Support both AI-driven (LLM) and script-based automation modes

🏗️ Project Structure

synthetic-data-agent/
├── conf/                          # Configuration files
│   ├── base/
│   │   ├── catalog.yml           # Data catalog definitions
│   │   └── parameters.yml        # Default parameters
│   └── local/
│   │   ├── credentials.yml        # API credentials (not in git)
│   │   └── README.md             # Configuration instructions
├── data/                         # Data layers following Kedro conventions
│   ├── 01_raw/
│   │   └── test_scripts/         # Input test scripts (JSON format)
│   └── 08_reporting/
│       ├── recordings/           # Browser session recordings (.vbrec files)
│       └── metadata/             # Recording metadata (JSON files)
├── src/synthetic_data_agent/     # Source code
│   ├── __init__.py
│   ├── __main__.py              # CLI entry point
│   ├── pipelines/               # Kedro pipelines
│   │   └── data_generation/     # Main pipeline
│   │       ├── __init__.py
│   │       ├── pipeline.py      # Pipeline definition
│   │       ├── nodes.py         # Pipeline nodes
│   │       ├── browser_agent.py # AI browser agent implementation
│   │       ├── browser_analyzer.py # Session analysis
│   │       ├── browser_recorder.py # Session recording
│   │       └── browser_replayer.py # Session replay functionality
│   └── settings.py              # Kedro settings
├── templates/                   # HTML templates
│   ├── index.html              # Dashboard HTML
│   └── replayer.html           # Replay viewer HTML
├── main.py                     # FastAPI web server
├── pyproject.toml             # Project configuration
├── requirements.txt           # Python dependencies
├── .env                       # Environment variables (not in git)
├── .gitignore                 # Git ignore rules
└── README.md                  # This file

🚀 Quick Start

Prerequisites

Python 3.9 or higher
pip for package management
Chrome/Chromium browser (for Playwright)
Azure OpenAI API access (for AI-driven mode)

Installation

Navigate to the project directory:
```
cd /path/to/synthetic-data-agent
```
Install the project in development mode:
```
pip install -e .
```
⚠️ Important: This step is critical for proper imports to work.
Install additional dependencies (can be skipped):
```
pip install -r requirements.txt
```
⚠️ Important: This step can be skipped unless there is a local requirements.txt file
Install Playwright browsers:
```
npx playwright install
```

Set up environment variables: Create a .env file in the project root:

# Disable Kedro telemetry
KEDRO_DISABLE_TELEMETRY=true
DO_NOT_TRACK=1

# Azure OpenAI credentials
AZURE_OPENAI_API_KEY=your_actual_api_key_here
AZURE_OPENAI_RESOURCE_NAME=your_azure_resource_name
AZURE_OPENAI_DEPLOYMENT_NAME=your_model_deployment_name
AZURE_OPENAI_API_VERSION=your_azure_api_version

Configure API credentials: Create conf/local/credentials.yml:

azure_openai:
  api_key: ${oc.env:AZURE_OPENAI_API_KEY}
  resource_name: ${oc.env:AZURE_OPENAI_RESOURCE_NAME}
  deployment_name: ${oc.env:AZURE_OPENAI_DEPLOYMENT_NAME}
  api_version: ${oc.env:AZURE_OPENAI_API_VERSION}

🎮 Usage

Option 1: Web Dashboard (Recommended)

Start the FastAPI web server:

python main.py

Then open http://localhost:8000 in your browser to access the dashboard where you can:

Submit new agent tasks with custom URLs and descriptions
View and download recordings with metadata
Replay browser sessions using the integrated viewer
Browse and manage test scripts
Monitor agent performance and analysis results

Option 2: Debug Mode

You can also run the agent in debug mode (ensure Kedro is installed):

# FastAPI debug mode
uvicorn main:app --reload

Option 3: Direct Pipeline Execution

Run the Kedro pipeline directly using command line:

Use default parameters from conf/base/parameters.yml:

kedro run

Override parameters at runtime:

kedro run --params="agent_params.task='Find the 'More information...' link, click it to open new page then indicate you are done, If you cannot find the link stop and indicate done.',agent_params.url='https://example.com',agent_params.mode='llm'"

Run with custom configuration:

# First, update conf/base/parameters.yml with your desired task
# Then run:
kedro run

Note: kedro run executes the pipeline directly without starting a web server. It uses parameters from the configuration files and is ideal for:

Automated/batch processing
CI/CD integration
Command-line scripting
Testing with fixed parameters

For interactive development, use Option 1 (Web Dashboard) instead.

🔧 Configuration

Agent Parameters

Configure agent behavior in conf/base/parameters.yml:

agent_params:
  task: "Your automation task description"
  url: "https://target-website.com"
  maxRetries: 15              # Number of retry attempts
  mode: "llm"                 # "llm" for AI-driven, "script" for predefined
  headless: false             # true for headless browser operation
  scriptName: null            # filename in test_scripts/ for script mode

Data Catalog Configuration

The data catalog in conf/base/catalog.yml defines data sources and outputs:

test_scripts: Input JSON test scripts (PartitionedDataset)
agent_recordings: Output browser session recordings as .vbrec files
agent_metadata: Recording metadata and analysis results as JSON

Operating Modes

1. LLM Mode (AI-Driven)

Agent analyzes the current page state
Sends simplified HTML to Azure OpenAI
Receives and executes action sequences
Records all interactions for later analysis

2. Script Mode (Predefined Actions)

Follows a JSON script of predefined actions
Useful for regression testing and consistent workflows
Scripts are stored in data/01_raw/test_scripts/

🧪 Development

Key Components

Core Classes

AIBrowserAgent: Main orchestrator for browser automation
AIAgentBrowserRecorder: Captures DOM events using rrweb
AIAgentAnalyzer: Evaluates agent performance using AI
AIAgentBrowserReplay: Replays recorded sessions

Pipeline Architecture

Node: run_browser_agent - Executes the agent and returns outputs
Inputs: Parameters and API configuration
Outputs: Session recordings and metadata

Adding New Features

New Action Types: Extend the action handlers in browser_agent.py
New Analysis Metrics: Modify the analyzer prompts in browser_analyzer.py
New Data Sources: Add datasets to conf/base/catalog.yml

📊 Data Flow

Input: User provides task description and target URL
Agent Initialization: Creates browser context and loads AI models
Task Execution:
- LLM Mode: AI analyzes page → generates actions → executes → repeats
- Script Mode: Follows predefined action sequence
Recording: All DOM events captured via rrweb
Analysis: AI evaluates performance against original task
Output:
- .vbrec file containing session recording
- .json file containing metadata and analysis

🔌 API Endpoints

The FastAPI server provides these REST endpoints:

GET /: Dashboard homepage with task submission form

POST /start-agent: Start a new agent task

{
  "task": "Navigate to the login page and sign in",
  "url": "https://example.com",
  "mode": "llm",
  "headless": false
}

GET /recordings: List all recordings with pagination and metadata
POST /replay: Replay a specific recording in browser
GET /download/{filename}: Download recording and metadata as ZIP
GET /test-scripts: List available test scripts
GET /test-scripts/{filename}: Retrieve specific test script

🐳 Environment Variables Reference

# Kedro Configuration
KEDRO_DISABLE_TELEMETRY=true
DO_NOT_TRACK=1

# Azure OpenAI Service
AZURE_OPENAI_API_KEY=your_api_key
AZURE_OPENAI_RESOURCE_NAME=your_resource_name
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4  # or your deployment
AZURE_OPENAI_API_VERSION=2024-08-01-preview

🚨 Troubleshooting

Installation Issues

ModuleNotFoundError: No module named 'synthetic_data_agent'

# Solution: Install in development mode
pip install -e . --break-system-packages --ignore-requires-python

Playwright browsers not found

# Solution: Install browser binaries
npx playwright install

kedro-datasets not found

# Solution: Install the datasets package
pip install kedro-datasets

Configuration Issues

Interpolation key 'AZURE_OPENAI_API_KEY' not found

Check that .env file exists in project root
Verify environment variables don't have quotes around values
Ensure python-dotenv is installed

Pipeline input 'credentials:azure_openai' not found

Verify conf/local/credentials.yml exists and has correct structure
Check that environment variables are properly loaded

Runtime Issues

Agent gets stuck or fails

Try running in non-headless mode: headless: false
Reduce maxRetries for faster debugging
Check console for errors

Recording playback fails

Ensure HTML templates are not being ignored by Git
Verify templates/replayer.html exists and is properly formatted
Check console for rrweb-player loading errors

📚 Additional Resources

Documentation Links

Project References

Configuration Guide: See conf/README.md for detailed setup instructions
API Reference: All endpoints documented with OpenAPI at /docs
Data Catalog: Detailed dataset definitions in conf/base/catalog.yml

📝 License

[Add your license information here]

🆘 Support

For issues and questions:

Check troubleshooting section above for common problems
Review configuration files in conf/ directory
Enable debug logging for detailed error information
Open an issue in the repository with error logs and steps to reproduce

⚠️ Important Notes:

Environment variables in .env should not have quotes
HTML templates need Git ignore exceptions to be tracked
Agent requires internet access for AI API calls

Then open http://localhost:8000 in your browser to access the dashboard where you can:

Submit new agent tasks
View and download recordings
Replay browser sessions
Manage test scripts

Note: This project is actively developed. Make sure to run pip install -e . --break-system-packages after any changes to the source code to ensure imports work correctly.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
conf		conf
data/01_raw/test_scripts		data/01_raw/test_scripts
notebooks		notebooks
src/synthetic_data_agent		src/synthetic_data_agent
templates		templates
.env.example		.env.example
.gitignore		.gitignore
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

deaspo/synthetic_data_agent

Folders and files

Latest commit

History

Repository files navigation