This guide explains how to create robust AI agents in Python by combining OpenAI's Agents SDK with Web Unlocker API to retrieve and process data from websites.
- What Is OpenAI Agents SDK?
- Major Challenges with This AI Agent Approach
- Integrating Agents SDK with a Web Unlocker API
- Step #1: Project Setup
- Step #2: Install the Project's Dependencies and Get Started
- Step #3: Set Up Environment Variables Reading
- Step #4: Set Up OpenAI Agents SDK
- Step #5: Set Up Web Unlocker API
- Step #6: Create the Web Page Content Extraction Function
- Step #7: Define the Data Models
- Step #8: Initialize the Agent logic
- Step #9: Implement the Execution Loop
- Step #10: Put It All Together
- Step #11: Test the AI Agent
The OpenAI Agents SDK is an open-source Python library created by OpenAI. It enables developers to build agent-based AI applications in a straightforward, efficient, and production-ready manner. This library represents a refined version of OpenAI's earlier experimental project called Swarm.
The OpenAI Agents SDK provides several essential components with minimal abstraction:
- Agents: LLMs coupled with specific instructions and tools to execute tasks
- Handoffs: Enabling agents to transfer tasks to other agents when necessary
- Guardrails: To verify agent inputs to ensure they conform to expected formats or requirements
These core elements, combined with Python's versatility, facilitate the creation of sophisticated interactions between agents and tools.
The SDK also features built-in tracing capabilities, allowing you to visualize, troubleshoot, and assess your agent workflows. It even supports model fine-tuning for your particular use cases.
Most AI agents aim to automate operations on web pages, whether extracting content or interacting with page elements. Essentially, they need to programmatically navigate the Web.
Beyond potential misinterpretations from the AI model itself, the most significant obstacle these agents encounter is dealing with websites' defensive mechanisms. This occurs because many sites implement anti-bot and anti-scraping technologies that can restrict or misdirect AI agents. This is particularly relevant today, as anti-AI CAPTCHAs and sophisticated bot detection systems become increasingly prevalent.
To overcome these obstacles, you need to enhance your agent's web navigation capabilities by integrating it with a solution like Bright Data's Web Unlocker API. This tool works with any HTTP client or solution that connects to the Internet (including AI agents), serving as a web-unlocking gateway. It provides clean, unblocked HTML from any webpage. No more CAPTCHAs, IP restrictions, or inaccessible content.
In this guided section, you'll discover how to integrate the OpenAI Agents SDK with Bright Data's Web Unlocker API to construct an AI agent capable of:
- Creating summaries of text from any web page
- Obtaining structured product information from e-commerce websites
- Collecting key details from news articles
To accomplish this, the agent will instruct the OpenAI Agents SDK to utilize the Web Unlocker API as a mechanism for obtaining the content of any web page. Once the content is acquired, the agent will apply AI logic to extract and format the data as required for each task.
Disclaimer:
The three use cases mentioned above are merely examples. The methodology presented here can be extended to numerous other scenarios by customizing the agent's behavior.
Follow these instructions to develop an AI scraping agent in Python using the OpenAI Agents SDK and Bright Data's Web Unlocker API for optimal performance.
Before starting this tutorial, ensure you have the following:
- Python 3 or higher installed on your computer
- An active Bright Data account
- An active OpenAI account
- A fundamental understanding of HTTP requests
- Some familiarity with Pydantic models
- A general understanding of AI agent functionality
First, verify that Python 3 is installed on your system. If not, download Python and follow the installation instructions for your operating system.
Launch your terminal and create a new directory for your scraping agent project:
mkdir openai-sdk-agent
The openai-sdk-agent
directory will house all the code for your Python-based, Agents SDK-powered agent.
Move into the project directory and establish a virtual environment:
cd openai-sdk-agent
python -m venv venv
Open the project directory in your preferred Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are excellent options.
Within the openai-sdk-agent
directory, create a new Python file named agent.py
. Your directory structure should now appear as follows:
Currently, scraper.py
is an empty Python script, but it will soon contain the desired AI agent logic.
In the IDE's terminal, activate the virtual environment. For Linux or macOS, execute this command:
./env/bin/activate
Similarly, on Windows, run:
env/Scripts/activate
This project utilizes the following Python libraries:
openai-agents
: The OpenAI Agents SDK, used for creating AI agents in Python.requests
: For connecting to Bright Data's Web Unlocker API and retrieving the HTML content of a web page for the AI agent to process. Learn more in our guide on mastering the Python Requests library.pydantic
: For defining structured output models, allowing the agent to return data in a clear and validated format.markdownify
: For converting raw HTML content into clean Markdown. (We'll explain the benefits of this shortly.)python-dotenv
: For loading environment variables from a.env
file. This is where we'll store credentials for OpenAI and Bright Data.
In an activated virtual environment, install them all with:
pip install requests pydantic openai-agents openai-agents markdownify python-dotenv
Now, set up scraper.py
with these imports and async boilerplate code:
import asyncio
from agents import Agent, RunResult, Runner, function_tool
import requests
from pydantic import BaseModel
from markdownify import markdownify as md
from dotenv import load_dotenv
# AI agent logic...
async def run():
# Call the async AI agent logic...
if __name__ == "__main__":
asyncio.run(run())
Create a .env
file in your project directory:
This file will store your environment variables, such as API keys and secret tokens. To load the environment variables from the .env
file, use load_dotenv()
from the dotenv
package:
load_dotenv()
You can now access specific environment variables using os.getenv()
like this:
os.getenv("ENV_NAME")
Remember to import os
from the Python standard library:
import os
You need a valid OpenAI API key to use the OpenAI Agents SDK. If you haven't generated one yet, follow OpenAI's official guide to create your API key.
After obtaining it, add the key to your .env
file like this:
OPENAI_API_KEY="<YOUR_OPENAI_KEY>"
Make sure to replace the <YOUR_OPENAI_KEY>
placeholder with your actual key.
No additional setup is necessary, as the openai-agents
SDK is designed to automatically retrieve the API key from the OPENAI_API_KEY
environment variable.
If you don't already have one, create a Bright Data account. Otherwise, simply log in.
Next, consult Bright Data's official Web Unlocker documentation to obtain your API token. Alternatively, follow these steps.
In your Bright Data "User Dashboard" page, select the "Get proxy products" option:
In the products table, find the row labeled "unblocker" and click on it:
On the "unlocker" page, copy your API token using the clipboard icon:
Also, verify that the toggle in the top-right corner is switched to "On," indicating that the Web Unlocker product is active.
Under the "Configuration" tab, make sure these options are enabled for optimal effectiveness:
In the .env
file, add this environment variable:
BRIGHT_DATA_WEB_UNLOCKER_API_TOKEN="<YOUR_BRIGHT_DATA_WEB_UNLOCKER_API_TOKEN>"
Replace the placeholder with your actual API token.
Create a get_page_content()
function that:
- Reads the
BRIGHT_DATA_WEB_UNLOCKER_API_TOKEN
environment variable - Uses
requests
to send a request to Bright Data's Web Unlocker API using the provided URL - Retrieves the raw HTML returned by the API
- Transforms the HTML to Markdown and returns it
Implement the above logic as follows:
@function_tool
def get_page_content(url: str) -> str:
"""
Retrieves the HTML content of a given web page using Bright Data's Web Unlocker API,
bypassing anti-bot protections. The response is converted from raw HTML to Markdown
for easier and cheaper processing.
Args:
url (str): The URL of the web page to scrape.
Returns:
str: The Markdown-formatted content of the requested page.
"""
# Read the Bright Data's Web Unlocker API token from the envs
BRIGHT_DATA_WEB_UNLOCKER_API_TOKEN = os.getenv("BRIGHT_DATA_WEB_UNLOCKER_API_TOKEN")
# Configure the Web Unlocker API call
api_url = "https://api.brightdata.com/request"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {BRIGHT_DATA_WEB_UNLOCKER_API_TOKEN}"
}
data = {
"zone": "unblocker",
"url": url,
"format": "raw"
}
# Make the call to Web Uncloker to retrieve the unblocked HTML of the target page
response = requests.post(api_url, headers=headers, data=json.dumps(data))
# Extract the raw HTML response
html = response.text
# Convert the HTML to markdown and return it
markdown_text = md(html)
return markdown_text
Note 1: The function must be annotated with @function_tool
. This special decorator informs the OpenAI Agents SDK that this function can be used as a tool by an agent to perform specific actions. In this case, the function serves as the "engine" the agent can utilize to retrieve the content of the web page it will process.
Note 2: The get_page_content()
function must explicitly declare the input types.
If you omit them, you'll encounter an error like: Error getting response: Error code: 400 - {'error': {'message': "Invalid schema for function 'get_page_content': In context=('properties', 'url'), schema must have a 'type' key.``"
Converting raw HTML to Markdown improves performance efficiency and cost-effectiveness because HTML is highly verbose and often contains unnecessary elements like scripts, styles, and metadata. AI agents don't need this content. If your agent only requires the essentials like text, links, and images, Markdown provides a much cleaner and more compact representation.
Specifically, the HTML-to-Markdown transformation can reduce the input size by up to 99%, saving both:
- Tokens, which reduces costs when using OpenAI models
- Processing time, since models operate faster on smaller inputs
For more insights, read the article "Why Are the New AI Agents Choosing Markdown Over HTML?"
For proper operation, OpenAI SDK agents require Pydantic models to define the expected structure of their output data. Remember that the agent we're building can return one of three possible outputs:
- A summary of the page
- Product information
- News article information
Let's define three corresponding Pydantic models:
class Summary(BaseModel):
summary: str
class Product(BaseModel):
name: str
price: Optional[float] = None
currency: Optional[str] = None
ratings: Optional[int] = None
rating_score: Optional[float] = None
class News(BaseModel):
title: str
subtitle: Optional[str] = None
authors: Optional[List[str]] = None
text: str
publication_date: Optional[str] = None
Note: Using Optional
makes your agent more versatile and general-purpose. Not all pages will include every piece of data defined in the schema, so this flexibility helps prevent errors when fields are missing.
Don't forget to import Optional
and List
from typing
:
from typing import Optional, List
Use the Agent
class from the openai-agents
SDK to define the three specialized agents:
summarization_agent = Agent(
name="Text Summarization Agent",
instructions="You are a content summarization agent that summarizes the input text.",
tools=[get_page_content],
output_type=Summary,
)
product_info_agent = Agent(
name="Product Information Agent",
instructions="You are a product parsing agent that extracts product details from text.",
tools=[get_page_content],
output_type=Product,
)
news_info_agent = Agent(
name="News Information Agent",
instructions="You are a news parsing agent that extracts relevant news details from text.",
tools=[get_page_content],
output_type=News,
)
Each agent:
- Contains a clear instruction string that describes its intended function. The OpenAI Agents SDK uses this to guide the agent's behavior.
- Uses
get_page_content()
as a tool to retrieve the input data (i.e., the content of the web page). - Returns its output in one of the Pydantic models (
Summary
,Product
, orNews
) defined earlier.
To automatically direct user requests to the appropriate specialized agent, define a higher-level agent:
routing_agent = Agent(
name="Routing Agent",
instructions=(
"You are a high-level decision-making agent. Based on the user's request, "
"hand off the task to the appropriate agent."
),
handoffs=[summarization_agent, product_info_agent, news_info_agent],
)
This is the agent you'll query in your run()
function to drive the AI agent logic.
In the run()
function, add this loop to launch your AI agent logic:
# Keep iterating until the use type "exit"
while True:
# Read the user's request
request = input("Your request -> ")
# Stops the execution if the user types "exit"
if request.lower() in ["exit"]:
print("Exiting the agent...")
break
# Read the page URL to operate on
url = input("Page URL -> ")
# Routing the user's request to the right agent
output = await Runner.run(routing_agent, input=f"{request} {url}")
# Conver the agent's output to a JSON string
json_output = json.dumps(output.final_output.model_dump(), indent=4)
print(f"Output -> \n{json_output}\n\n")
This loop continuously monitors for user input and processes each request by routing it to the appropriate agent (summary, product, or news). It combines the user's query with the target URL, executes the logic, and then displays the structured result in JSON format using json
. Import it with:
import json
Your scraper.py
file should now contain:
import asyncio
from agents import Agent, RunResult, Runner, function_tool
import requests
from pydantic import BaseModel
from markdownify import markdownify as md
from dotenv import load_dotenv
import os
from typing import Optional, List
import json
# Load the environment variables from the .env file
load_dotenv()
# Define the Pydantic output models for your AI agent
class Summary(BaseModel):
summary: str
class Product(BaseModel):
name: str
price: Optional[float] = None
currency: Optional[str] = None
ratings: Optional[int] = None
rating_score: Optional[float] = None
class News(BaseModel):
title: str
subtitle: Optional[str] = None
authors: Optional[List[str]] = None
text: str
publication_date: Optional[str] = None
@function_tool
def get_page_content(url: str) -> str:
"""
Retrieves the HTML content of a given web page using Bright Data's Web Unlocker API,
bypassing anti-bot protections. The response is converted from raw HTML to Markdown
for easier and cheaper processing.
Args:
url (str): The URL of the web page to scrape.
Returns:
str: The Markdown-formatted content of the requested page.
"""
# Read the Bright Data's Web Unlocker API token from the envs
BRIGHT_DATA_WEB_UNLOCKER_API_TOKEN = os.getenv("BRIGHT_DATA_WEB_UNLOCKER_API_TOKEN")
# Configure the Web Unlocker API call
api_url = "https://api.brightdata.com/request"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {BRIGHT_DATA_WEB_UNLOCKER_API_TOKEN}"
}
data = {
"zone": "unblocker",
"url": url,
"format": "raw"
}
# Make the call to Web Uncloker to retrieve the unblocked HTML of the target page
response = requests.post(api_url, headers=headers, data=json.dumps(data))
# Extract the raw HTML response
html = response.text
# Convert the HTML to markdown and return it
markdown_text = md(html)
return markdown_text
# Define the individual OpenAI agents
summarization_agent = Agent(
name="Text Summarization Agent",
instructions="You are a content summarization agent that summarizes the input text.",
tools=[get_page_content],
output_type=Summary,
)
product_info_agent = Agent(
name="Product Information Agent",
instructions="You are a product parsing agent that extracts product details from text.",
tools=[get_page_content],
output_type=Product,
)
news_info_agent = Agent(
name="News Information Agent",
instructions="You are a news parsing agent that extracts relevant news details from text.",
tools=[get_page_content],
output_type=News,
)
# Define a high-level routing agent that delegates tasks to the appropriate specialized agent
routing_agent = Agent(
name="Routing Agent",
instructions=(
"You are a high-level decision-making agent. Based on the user's request, "
"hand off the task to the appropriate agent."
),
handoffs=[summarization_agent, product_info_agent, news_info_agent],
)
async def run():
# Keep iterating until the use type "exit"
while True:
# Read the user's request
request = input("Your request -> ")
# Stops the execution if the user types "exit"
if request.lower() in ["exit"]:
print("Exiting the agent...")
break
# Read the page URL to operate on
url = input("Page URL -> ")
# Routing the user's request to the right agent
output = await Runner.run(routing_agent, input=f"{request} {url}")
# Conver the agent's output to a JSON string
json_output = json.dumps(output.final_output.model_dump(), indent=4)
print(f"Output -> \n{json_output}\n\n")
if __name__ == "__main__":
asyncio.run(run())
To launch your AI agent, execute:
python agent.py
Let's say you want to summarize the content from Bright Data's AI services hub. Simply enter a request like this:
Here's the JSON-formatted result you'll receive:
Now, imagine you want to extract product data from an Amazon product page, such as the PS5 listing:
Typically, Amazon's CAPTCHA and anti-bot systems would block your request. With the Web Unlocker API, your AI agent can access and analyze the page without being blocked:
The output will be:
{
"name": "PlayStation\u00ae5 console (slim)",
"price": 499.0,
"currency": "USD",
"ratings": 6321,
"rating_score": 4.7
}
Finally, suppose you want to get structured news information from a Yahoo News article:
Accomplish this with the following input:
Your request -> Give me news info
Page URL -> https://www.yahoo.com/news/pope-francis-dies-88-080859417.html
The result will be:
{
"title": "Pope Francis Dies at 88",
"subtitle": null,
"authors": [
"Nick Vivarelli",
"Wilson Chapman"
],
"text": "Pope Francis, the 266th Catholic Church leader who tried to position the church to be more inclusive, died on Easter Monday, Vatican officials confirmed. He was 88. (omitted for brevity...)",
"publication_date": "Mon, April 21, 2025 at 8:08 AM UTC"
}
Combining the OpenAI SDK with Bright Data's Web Unlocker API enables you to develop AI agents that can reliably operate on virtually any web page. This is just one example of how Bright Data's products and services can support advanced AI integrations.
Explore our complete range of AI products: autonomous AI agents, vertical AI apps, foundation models, multimodal AI, data providers, data packages, and more.
Create a Bright Data account and try all our products and services for AI agent development today!