Skip to content

MeetJariwala10/VoiceOps

Β 
Β 

Repository files navigation

πŸš€ VideoSDK AI Agent Quick Start

This repository contains quick start examples for integrating AI-powered voice agents into VideoSDK meetings using different LLM providers (OpenAI, Google Gemini LiveAPI, and AWS NovaSonic). Featured: Complete Agent to Agent (A2A) multi-agent system implementation.and support for virtual avatarsβ€”realistic, lip-synced avatars that mirror speech in real time and give your AI agents a visual, human-like presence.

What are VideoSDK AI Agents?

The VideoSDK AI Agent framework is a Python SDK that enables AI-powered agents to join VideoSDK rooms as participants. This framework serves as a real-time bridge between AI models (like OpenAI, Google Gemini LiveAPI, and AWS) and your users, facilitating seamless voice and media interactions.

The framework offers two distinct approaches to building AI agents:

  1. Integrated Real-time Pipelines: Use providers like Google Gemini Live API for end-to-end, low-latency conversational AI with built-in STT, LLM, and TTS capabilities.

  2. Cascading Pipelines: Build custom AI agents by mixing and matching different providers for Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). This approach gives you complete control over your agent's architecture, allowing you to optimize for cost, performance, language support, or specific use cases.

Architecture Overview

  • Your Backend: Hosts the Worker and Agent Job that powers the AI agents
  • VideoSDK Cloud: Manages the meeting rooms where agents and users interact in real time
  • Client SDK: Applications on user devices (web, mobile, or SIP) that connect to VideoSDK meetings

✨ Key Features

  • Voice-Enabled AI Agents: Integrate AI agents that can speak and listen in real-time meetings
  • Multiple LLM Providers: Support for OpenAI, Google Gemini LiveAPI, and AWS Nova Sonic
  • Modular & Flexible Pipelines: Choose between integrated real-time pipelines or build your own with the CascadingPipeline to mix and match STT, LLM, and TTS providers
  • πŸ€– Agent to Agent (A2A) Communication: Enable specialized agents to collaborate and share domain expertise
  • Function Tools: Enable your agents with capabilities like retrieving data or performing actions
  • Real-time Communication: Seamless integration with VideoSDK's real-time communication platform
  • Vision Support: Direct video input from VideoSDK rooms to Gemini Live by setting vision=True in the session context.(Note: Vision is exclusively supported with Gemini models via the Gemini Live API)
  • Virtual Avatar: Enhance your AI agents with realistic, lip-synced virtual avatars using the Simli integration. Create more engaging and interactive experiences.(Works with both RealtimePipeline and CascadingPipeline approaches)

πŸ”§ Why Choose Cascading Pipeline?

The CascadingPipeline approach is particularly powerful for:

  • Cost Optimization: Mix premium and cost-effective services (e.g., use Deepgram for STT, OpenAI for LLM, and a budget TTS provider)
  • Multi-language Support: Use specialized STT providers for different languages while keeping the same LLM
  • Performance Tuning: Choose the fastest provider for each component based on your requirements
  • Compliance & Regional Requirements: Use specific providers that meet your regulatory or data residency needs
  • Custom Processing: Add your own logic between STT and LLM processing through ConversationFlow

🧠 Core Components

The SDK is built around several core components that work together to create powerful AI agents:

  • Agent: The base class for defining your agent's identity, including its instructions, tools (functions), and connections to external services via MCP.
  • Pipeline: Manages the real-time flow of audio and data between the user and the AI models. The SDK offers two types of pipelines:
    • RealtimePipeline: An all-in-one pipeline for providers like Google Gemini Live, optimized for low-latency, conversational AI.
    • CascadingPipeline: A modular pipeline that gives you the flexibility to mix and match different providers for Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). This allows you to tailor your agent's stack for cost, performance, or specific language needs. See our Cascading Pipeline example to learn more.
  • Conversation Flow: An inheritable class that works with the CascadingPipeline to let you define custom turn-taking logic, preprocess transcripts, and integrate memory or Retrieval-Augmented Generation (RAG) before the LLM is called.
  • Agent Session: Manages the agent's lifecycle within a VideoSDK meeting, bringing together the agent, pipeline, and conversation flow to create a seamless interactive experience.

πŸ€– Agent to Agent (A2A) Multi-Agent System

Our featured A2A implementation enables seamless collaboration between specialized AI agents, similar to Google's A2A protocol. This allows different agents to communicate, share knowledge, and coordinate responses based on their unique capabilities.

How A2A Works

  1. Agent Registration: Agents register themselves with an AgentCard containing their capabilities and domain expertise
  2. Client Query: Client sends a query to the main agent
  3. Agent Discovery: Main agent discovers relevant specialist agents using agent cards
  4. Query Forwarding: Main agent forwards specialized queries to appropriate agents
  5. Response Chain: Specialist agents process queries and respond back to the main agent
  6. Client Response: Main agent formats and delivers the final response to the client

Example A2A Use Case:

When a user asks about loan rates, the Customer Service Agent (with audio capabilities) automatically forwards the query to the Loan Agent (text-based specialist), receives the expert response, and relays it back to the user - all within a single conversation flow.

Client β†’ "I want to know about personal loan rates"
   ↓
Customer Service Agent β†’ Discovers Loan Specialist Agent
   ↓  
Customer Service Agent β†’ Forwards loan query to Loan Specialist
   ↓
Loan Specialist β†’ Processes query and responds back (text format)
   ↓
Customer Service Agent β†’ Relays response to client (audio format)

Key A2A Features:

  • Multi-Modal Communication: Audio agents for user interaction, text agents for specialized processing
  • Domain Specialization: Customer service agents coordinate with loan specialists, tech support, financial advisors
  • Intelligent Query Routing: Automatic detection and forwarding of domain-specific queries
  • Real-Time Collaboration: Agents communicate seamlessly without user intervention

For detailed A2A implementation, see the A2A README.

Prerequisites

Before you begin, ensure you have:

  • Python 3.12 or higher
  • A VideoSDK authentication token (generate from app.videosdk.live)
  • A VideoSDK meeting ID (you can generate one using the Create Room API)
  • API key for your chosen LLM provider (OpenAI, Google Gemini LiveAPI, or AWS)
  • Client-side implementation with any VideoSDK SDK

πŸ› οΈ Installation

Quick Setup (Recommended)

For the fastest setup, install all dependencies at once using the provided requirements file:

# 1. Clone this repository
git clone https://github.com/videosdk-live/agents-quickstart

# 2. Navigate to the project directory
cd agents-quickstart

# 3. Create and activate a virtual environment with Python 3.12 or higher
# On macOS/Linux
python3.12 -m venv venv
source venv/bin/activate

# On Windows
python -m venv venv
venv\Scripts\activate

# 4. Install all dependencies from requirements.txt
pip install -r requirements.txt

Manual Installation

Alternatively, you can install packages individually:

  1. Clone this repository:
git clone https://github.com/videosdk-live/agents-quickstart
  1. Create and activate a virtual environment with Python 3.12 or higher:
# On macOS/Linux
python3.12 -m venv venv
source venv/bin/activate

# On Windows
python -m venv venv
venv\Scripts\activate
  1. Install the base package:
pip install videosdk-agents
  1. Then navigate to your choice of example available:

πŸ”— Model Context Protocol (MCP) Integration

All agent examples include Model Context Protocol (MCP) support for connecting to external data sources and tools:

  • Local MCP Servers: Use MCPServerStdio for development and testing
  • Remote MCP Services: Use MCPServerHTTP for production integrations
  • Multiple Servers: Connect to various data sources simultaneously

For detailed MCP integration examples, see the MCP Server README.

Environment Setup

It's recommended to use environment variables for secure storage of API keys and tokens. Create a .env file in your project root:

VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token

Generating a VideoSDK Meeting ID

Before your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:

Using cURL

curl -X POST https://api.videosdk.live/v2/rooms \
  -H "Authorization: VIDEOSDK_AUTH_TOKEN" \
  -H "Content-Type: application/json"

For more details on the Create Room API, refer to the VideoSDK documentation.

Connecting with VideoSDK Client Applications

After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:

When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.

Playground Mode

All quickstart examples are configured to run in playground mode by default (playground=True). When you run an agent, a direct link to the VideoSDK Playground will be printed in your console. You can open this link in your browser to interact with your agent without needing a separate client application.

Agent started in playground mode
Interact with agent here at:
https://playground.videosdk.live?token=...&meetingId=...

πŸ“ Repository Structure

agents-quickstart/
β”‚
β”œβ”€β”€ A2A/                           # Featured: Complete A2A multi-agent system
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ customer_agent.py      # Voice-enabled customer service agent
β”‚   β”‚   β”œβ”€β”€ loan_agent.py          # Text-based loan specialist agent
β”‚   β”‚   └── README.md              # Detailed A2A implementation guide
β”‚   β”œβ”€β”€ session_manager.py         # Session and pipeline management
β”‚   β”œβ”€β”€ main.py                    # A2A system entry point
β”‚   └── README.md                  # A2A overview and setup
β”‚
β”œβ”€β”€ Virtual Avatar/                # Simli virtual avatar integration examples
β”‚   β”œβ”€β”€ simli_cascading_example.py # Cascading pipeline with Simli avatar
β”‚   β”œβ”€β”€ simli_realtime_example.py  # Realtime pipeline with Simli avatar
β”‚   └── README.md                  # Virtual avatar setup and configuration
β”‚
β”œβ”€β”€ OpenAI/                        # OpenAI-based agent examples
β”œβ”€β”€ Google Gemini (LiveAPI)/       # Google Gemini LiveAPI examples  
β”œβ”€β”€ Cascading Pipeline/            # Example of a modular pipeline
β”œβ”€β”€ AWS Nova Sonic/                # AWS Nova Sonic examples
β”œβ”€β”€ MCP Server/                    # Model Context Protocol examples
β”œβ”€β”€ requirements.txt               # All dependencies
└── README.md                      # This file

Learn More

For more information about VideoSDK AI Agents:


🀝 Join our Discord community for support and discussions.

Made with ❀️ by the VideoSDK Team

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%