This repository contains quick start examples for integrating AI-powered voice agents into VideoSDK meetings using different LLM providers (OpenAI, Google Gemini LiveAPI, and AWS NovaSonic). Featured: Complete Agent to Agent (A2A) multi-agent system implementation.and support for virtual avatarsβrealistic, lip-synced avatars that mirror speech in real time and give your AI agents a visual, human-like presence.
The VideoSDK AI Agent framework is a Python SDK that enables AI-powered agents to join VideoSDK rooms as participants. This framework serves as a real-time bridge between AI models (like OpenAI, Google Gemini LiveAPI, and AWS) and your users, facilitating seamless voice and media interactions.
The framework offers two distinct approaches to building AI agents:
-
Integrated Real-time Pipelines: Use providers like Google Gemini Live API for end-to-end, low-latency conversational AI with built-in STT, LLM, and TTS capabilities.
-
Cascading Pipelines: Build custom AI agents by mixing and matching different providers for Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). This approach gives you complete control over your agent's architecture, allowing you to optimize for cost, performance, language support, or specific use cases.
- Your Backend: Hosts the Worker and Agent Job that powers the AI agents
- VideoSDK Cloud: Manages the meeting rooms where agents and users interact in real time
- Client SDK: Applications on user devices (web, mobile, or SIP) that connect to VideoSDK meetings
- Voice-Enabled AI Agents: Integrate AI agents that can speak and listen in real-time meetings
- Multiple LLM Providers: Support for OpenAI, Google Gemini LiveAPI, and AWS Nova Sonic
- Modular & Flexible Pipelines: Choose between integrated real-time pipelines or build your own with the
CascadingPipelineto mix and match STT, LLM, and TTS providers - π€ Agent to Agent (A2A) Communication: Enable specialized agents to collaborate and share domain expertise
- Function Tools: Enable your agents with capabilities like retrieving data or performing actions
- Real-time Communication: Seamless integration with VideoSDK's real-time communication platform
- Vision Support: Direct video input from VideoSDK rooms to Gemini Live by setting
vision=Truein the session context.(Note: Vision is exclusively supported with Gemini models via the Gemini Live API) - Virtual Avatar: Enhance your AI agents with realistic, lip-synced virtual avatars using the Simli integration. Create more engaging and interactive experiences.(Works with both RealtimePipeline and CascadingPipeline approaches)
The CascadingPipeline approach is particularly powerful for:
- Cost Optimization: Mix premium and cost-effective services (e.g., use Deepgram for STT, OpenAI for LLM, and a budget TTS provider)
- Multi-language Support: Use specialized STT providers for different languages while keeping the same LLM
- Performance Tuning: Choose the fastest provider for each component based on your requirements
- Compliance & Regional Requirements: Use specific providers that meet your regulatory or data residency needs
- Custom Processing: Add your own logic between STT and LLM processing through
ConversationFlow
The SDK is built around several core components that work together to create powerful AI agents:
- Agent: The base class for defining your agent's identity, including its instructions, tools (functions), and connections to external services via MCP.
- Pipeline: Manages the real-time flow of audio and data between the user and the AI models. The SDK offers two types of pipelines:
RealtimePipeline: An all-in-one pipeline for providers like Google Gemini Live, optimized for low-latency, conversational AI.CascadingPipeline: A modular pipeline that gives you the flexibility to mix and match different providers for Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). This allows you to tailor your agent's stack for cost, performance, or specific language needs. See our Cascading Pipeline example to learn more.
- Conversation Flow: An inheritable class that works with the
CascadingPipelineto let you define custom turn-taking logic, preprocess transcripts, and integrate memory or Retrieval-Augmented Generation (RAG) before the LLM is called. - Agent Session: Manages the agent's lifecycle within a VideoSDK meeting, bringing together the agent, pipeline, and conversation flow to create a seamless interactive experience.
Our featured A2A implementation enables seamless collaboration between specialized AI agents, similar to Google's A2A protocol. This allows different agents to communicate, share knowledge, and coordinate responses based on their unique capabilities.
- Agent Registration: Agents register themselves with an
AgentCardcontaining their capabilities and domain expertise - Client Query: Client sends a query to the main agent
- Agent Discovery: Main agent discovers relevant specialist agents using agent cards
- Query Forwarding: Main agent forwards specialized queries to appropriate agents
- Response Chain: Specialist agents process queries and respond back to the main agent
- Client Response: Main agent formats and delivers the final response to the client
When a user asks about loan rates, the Customer Service Agent (with audio capabilities) automatically forwards the query to the Loan Agent (text-based specialist), receives the expert response, and relays it back to the user - all within a single conversation flow.
Client β "I want to know about personal loan rates"
β
Customer Service Agent β Discovers Loan Specialist Agent
β
Customer Service Agent β Forwards loan query to Loan Specialist
β
Loan Specialist β Processes query and responds back (text format)
β
Customer Service Agent β Relays response to client (audio format)
- Multi-Modal Communication: Audio agents for user interaction, text agents for specialized processing
- Domain Specialization: Customer service agents coordinate with loan specialists, tech support, financial advisors
- Intelligent Query Routing: Automatic detection and forwarding of domain-specific queries
- Real-Time Collaboration: Agents communicate seamlessly without user intervention
For detailed A2A implementation, see the A2A README.
Before you begin, ensure you have:
- Python 3.12 or higher
- A VideoSDK authentication token (generate from app.videosdk.live)
- A VideoSDK meeting ID (you can generate one using the Create Room API)
- API key for your chosen LLM provider (OpenAI, Google Gemini LiveAPI, or AWS)
- Client-side implementation with any VideoSDK SDK
For the fastest setup, install all dependencies at once using the provided requirements file:
# 1. Clone this repository
git clone https://github.com/videosdk-live/agents-quickstart
# 2. Navigate to the project directory
cd agents-quickstart
# 3. Create and activate a virtual environment with Python 3.12 or higher
# On macOS/Linux
python3.12 -m venv venv
source venv/bin/activate
# On Windows
python -m venv venv
venv\Scripts\activate
# 4. Install all dependencies from requirements.txt
pip install -r requirements.txtAlternatively, you can install packages individually:
- Clone this repository:
git clone https://github.com/videosdk-live/agents-quickstart- Create and activate a virtual environment with Python 3.12 or higher:
# On macOS/Linux
python3.12 -m venv venv
source venv/bin/activate
# On Windows
python -m venv venv
venv\Scripts\activate- Install the base package:
pip install videosdk-agents- Then navigate to your choice of example available:
- π€ Agent to Agent (A2A) Multi-Agent System β Featured
- π Virtual Avatar Examples β With Simli Integration
- OpenAI Agent
- Google Gemini LiveAPI Agent
- Cascading Pipeline Agent
- AWS Nova Sonic Agent
- π MCP Server Examples
All agent examples include Model Context Protocol (MCP) support for connecting to external data sources and tools:
- Local MCP Servers: Use
MCPServerStdiofor development and testing - Remote MCP Services: Use
MCPServerHTTPfor production integrations - Multiple Servers: Connect to various data sources simultaneously
For detailed MCP integration examples, see the MCP Server README.
It's recommended to use environment variables for secure storage of API keys and tokens. Create a .env file in your project root:
VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_tokenBefore your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:
curl -X POST https://api.videosdk.live/v2/rooms \
-H "Authorization: VIDEOSDK_AUTH_TOKEN" \
-H "Content-Type: application/json"For more details on the Create Room API, refer to the VideoSDK documentation.
After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:
When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.
All quickstart examples are configured to run in playground mode by default (playground=True). When you run an agent, a direct link to the VideoSDK Playground will be printed in your console. You can open this link in your browser to interact with your agent without needing a separate client application.
Agent started in playground mode
Interact with agent here at:
https://playground.videosdk.live?token=...&meetingId=...
agents-quickstart/
β
βββ A2A/ # Featured: Complete A2A multi-agent system
β βββ agents/
β β βββ customer_agent.py # Voice-enabled customer service agent
β β βββ loan_agent.py # Text-based loan specialist agent
β β βββ README.md # Detailed A2A implementation guide
β βββ session_manager.py # Session and pipeline management
β βββ main.py # A2A system entry point
β βββ README.md # A2A overview and setup
β
βββ Virtual Avatar/ # Simli virtual avatar integration examples
β βββ simli_cascading_example.py # Cascading pipeline with Simli avatar
β βββ simli_realtime_example.py # Realtime pipeline with Simli avatar
β βββ README.md # Virtual avatar setup and configuration
β
βββ OpenAI/ # OpenAI-based agent examples
βββ Google Gemini (LiveAPI)/ # Google Gemini LiveAPI examples
βββ Cascading Pipeline/ # Example of a modular pipeline
βββ AWS Nova Sonic/ # AWS Nova Sonic examples
βββ MCP Server/ # Model Context Protocol examples
βββ requirements.txt # All dependencies
βββ README.md # This file
For more information about VideoSDK AI Agents:
- Official Documentation
- AI Voice Agent Quick Start Guide
- Core Components Overview
- Cascading Pipeline Documentation
- Conversation Flow Documentation
- MCP Integration
- A2A Integration Documentation
- Virtual Avatar
π€ Join our Discord community for support and discussions.
Made with β€οΈ by the VideoSDK Team