Skip to content

A modern, fully voice-activated conversational AI assistant web application with continuous listening, real-time VAD processing, and auto-fading responses. Built using WebRTC, open-source LLMs, and advanced audio processing.

Notifications You must be signed in to change notification settings

krithicswaroopan/Real-Time_Voice_AI_Assistant-Web

Repository files navigation

🧠 Real-Time Conversational AI Assistant

A modern, fully voice-activated conversational AI assistant web application with continuous listening, real-time VAD processing, and auto-fading responses. Built using WebRTC, open-source LLMs, and advanced audio processing.

✨ Key Features

🎀 Live Voice Interface (NEW!)

  • πŸ”₯ Continuous Listening: Auto-starts on page load - no buttons required!
  • 🎯 Voice Activity Detection: Real-time VAD with webrtcvad processing
  • ⚑ Live Audio Streaming: WebSocket-based continuous audio pipeline
  • πŸŽ™οΈ Smart Voice Processing: Noise suppression, auto-gain, echo cancellation
  • πŸ“± Button-Free Experience: Pure voice interaction - just speak naturally!

πŸ€– AI Response System

  • ✨ Auto-Generated Responses: Instant AI responses to voice input
  • πŸ—£οΈ Text-to-Speech: Automatic speech synthesis and playback
  • ⏰ 20-Second Auto-Fade: Responses fade away after TTS + 20 seconds
  • πŸ”„ Continuous Loop: Always ready for the next voice interaction

🎨 Modern Voice UI

  • 🌟 Live Audio Visualizer: Real-time voice activity display
  • πŸ“Š VAD Status Indicators: Visual feedback for listening state
  • 🎭 Smooth Animations: Framer Motion powered transitions
  • πŸŒ™ Voice-First Design: Minimal, distraction-free interface

πŸ”§ Advanced Audio Processing

  • 🎯 Real-time VAD: webrtcvad with configurable sensitivity
  • πŸ”Š Audio Normalization: Automatic level adjustment
  • 🎚️ Noise Suppression: Advanced filtering algorithms
  • πŸ“‘ WebRTC Streaming: Low-latency audio transmission

🌐 Real-time Communication

  • ⚑ WebSocket Streaming: Continuous audio chunk processing
  • 🎯 VAD-Triggered Transcription: Process speech only when detected
  • πŸ”„ Auto-Reconnection: Robust connection management
  • πŸ“Š Live Status Monitoring: Real-time connection health

πŸ€– AI Models & Processing

  • πŸ¦™ Open-Source LLMs: Access to Llama, Mistral, CodeLlama via OpenRouter
  • 🎧 Speech Recognition: OpenAI Whisper integration
  • πŸ—£οΈ Text-to-Speech: Coqui TTS with multiple voices
  • πŸ’¬ Streaming Responses: Real-time text generation

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LIVE VOICE INTERFACE                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  🎀 Auto-Start β†’ 🎯 VAD β†’ πŸ“ ASR β†’ πŸ€– LLM β†’ πŸ—£οΈ TTS β†’ ⏰ Fade  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    WebSocket     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   React Frontend β”‚ ◄──────────────► β”‚  FastAPI Backend β”‚
β”‚                 β”‚  (Live Audio)    β”‚                 β”‚
β”‚ β€’ Live Audio    β”‚                  β”‚ β€’ VAD Service   β”‚
β”‚ β€’ Voice Visualizerβ”‚                β”‚ β€’ ASR Service   β”‚
β”‚ β€’ Auto-Fade UI  β”‚                  β”‚ β€’ LLM Service   β”‚
β”‚ β€’ No Buttons!   β”‚                  β”‚ β€’ TTS Service   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚ β€’ Streaming     β”‚
                                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚
                                              β–Ό
                                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                     β”‚  External APIs  β”‚
                                     β”‚                 β”‚
                                     β”‚ β€’ OpenRouter    β”‚
                                     β”‚ β€’ Whisper ASR   β”‚
                                     β”‚ β€’ Coqui TTS     β”‚
                                     β”‚ β€’ WebRTC VAD    β”‚
                                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • OpenRouter API Key (for LLM and Whisper access)
  • Microphone permissions (for voice input)

1. Clone the Repository

git clone <repository-url>
cd AI-Convo-Webapp

2. Environment Setup

Copy the example environment file and configure your settings:

cp env.example .env

Edit .env with your API keys:

# Required: OpenRouter API Key
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Security
SECRET_KEY=your_secret_key_here_make_it_long_and_random

# Voice Activity Detection
VAD_MODE=3                    # VAD sensitivity (0-3, higher = more sensitive)
SAMPLE_RATE=16000            # Audio sample rate for VAD
CHUNK_DURATION_MS=30         # VAD processing chunk size

# Audio Processing
ENABLE_NOISE_SUPPRESSION=true
ENABLE_AUTO_GAIN=true
ENABLE_ECHO_CANCELLATION=true

# TTS Configuration
TTS_MODEL=tts_models/en/ljspeech/tacotron2-DDC
TTS_DEFAULT_VOICE=ljspeech
RESPONSE_FADE_DELAY=20       # Seconds to fade response after TTS

3. Start with Docker Compose

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

4. Access the Voice Interface

🎀 How to Use the Voice Interface

Getting Started

  1. Open the app β†’ http://localhost:3000
  2. Grant microphone permissions when prompted
  3. Start speaking β†’ The app automatically starts listening!

Voice Interaction Flow

🎀 Speak β†’ 🎯 VAD Detects β†’ πŸ“ Transcribes β†’ πŸ€– AI Responds β†’ πŸ—£οΈ TTS Plays β†’ ⏰ Fades (20s)

Visual Indicators

  • 🟒 Green Pulse: Listening and ready
  • πŸ”΄ Red Ripple: Voice activity detected
  • 🟑 Yellow: Processing your speech
  • πŸ”΅ Blue: AI generating response

Tips for Best Experience

  • Speak clearly and at normal volume
  • Wait for the pulse to show it's listening
  • Pause briefly between sentences
  • Stay within 3 feet of microphone for best VAD detection

πŸ› οΈ Development Setup

Backend Development

cd backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run development server with VAD
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Frontend Development

cd frontend

# Install dependencies
npm install

# Start development server (with live audio)
npm start

# Build for production
npm run build

πŸ“š API Documentation

Live Audio Endpoints

WebSocket Streaming

  • WS /api/v1/streaming/ws - Live audio streaming endpoint
  • Message Types:
    • audio_chunk - Raw audio data for VAD processing
    • transcription_request - Request speech transcription
    • chat_request - Send message to AI
    • tts_request - Generate speech audio

VAD & Audio Processing

  • POST /api/v1/audio/process - Process audio with VAD
  • POST /api/v1/audio/normalize - Normalize audio levels
  • GET /api/v1/audio/vad-status - Check VAD configuration

Traditional Endpoints

ASR (Speech-to-Text)

  • POST /api/v1/asr/transcribe - Transcribe audio file
  • POST /api/v1/asr/transcribe-streaming - Stream transcription

Chat (LLM)

  • POST /api/v1/chat/generate - Generate AI response
  • POST /api/v1/chat/stream - Streaming response

TTS (Text-to-Speech)

  • POST /api/v1/tts/synthesize - Synthesize speech
  • GET /api/v1/tts/voices - Available voices

πŸŽ›οΈ Configuration

Voice Activity Detection Settings

Variable Description Default
VAD_MODE VAD sensitivity (0-3) 3
SAMPLE_RATE Audio sample rate 16000
CHUNK_DURATION_MS VAD processing chunk size 30
SILENCE_THRESHOLD_MS Silence detection threshold 500

Audio Processing Settings

Variable Description Default
ENABLE_NOISE_SUPPRESSION Enable noise filtering true
ENABLE_AUTO_GAIN Enable auto-gain control true
ENABLE_ECHO_CANCELLATION Enable echo cancellation true
AUDIO_NORMALIZATION_TARGET Target audio level 5000

Response Behavior

Variable Description Default
RESPONSE_FADE_DELAY Seconds to fade after TTS 20
TYPING_ANIMATION_SPEED Typing effect speed (ms) 50
AUTO_PLAY_TTS Auto-play TTS responses true

πŸ”§ Customization

Adjusting VAD Sensitivity

# In backend/app/config.py
VAD_MODE = 3  # 0 = least sensitive, 3 = most sensitive

Changing Response Fade Time

// In frontend/src/components/ResponseDisplay.tsx
const FADE_DELAY = 20000; // 20 seconds (change as needed)

Custom Audio Processing

# In backend/app/services/audio_service.py
class AudioService:
    def process_audio_chunk(self, audio_data: bytes):
        # Add your custom VAD logic here
        pass

πŸš€ Deployment

Production Configuration

# Set production environment variables
export ENVIRONMENT=production
export VAD_MODE=3
export ENABLE_NOISE_SUPPRESSION=true
export RESPONSE_FADE_DELAY=20

# Deploy with Docker
docker-compose --profile production up -d

Performance Optimization

  • VAD Processing: Adjust CHUNK_DURATION_MS for latency vs accuracy
  • Audio Quality: Set SAMPLE_RATE to 16000 for optimal VAD performance
  • Memory Usage: Configure audio buffer sizes based on usage patterns

πŸ§ͺ Testing the Voice Interface

Voice Testing Commands

# Test VAD sensitivity
curl -X POST http://localhost:8000/api/v1/audio/test-vad \
  -H "Content-Type: application/json" \
  -d '{"sensitivity": 3}'

# Test audio processing pipeline
curl -X POST http://localhost:8000/api/v1/audio/test-pipeline \
  -F "audio=@test_audio.wav"

Browser Testing

  1. Open Developer Tools β†’ Console
  2. Check microphone permissions
  3. Test WebSocket connection
  4. Verify VAD detection

🀝 Contributing

Voice Interface Development

  1. Frontend: React components for live audio visualization
  2. Backend: VAD and streaming audio processing
  3. WebSocket: Real-time audio chunk handling
  4. Audio Processing: VAD algorithms and noise suppression

Testing Voice Features

  • Test with different microphone setups
  • Verify VAD sensitivity in various noise conditions
  • Test auto-fade timing and TTS integration
  • Validate WebSocket connection stability

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • WebRTC VAD for real-time voice activity detection
  • OpenRouter for providing access to open-source LLMs
  • OpenAI for Whisper speech recognition
  • Coqui AI for open-source Text-to-Speech
  • Material-UI for the beautiful component library
  • FastAPI for the high-performance backend framework

πŸ“ž Support

For voice interface issues:

  • Audio Problems: Check microphone permissions and VAD settings

  • Connection Issues: Verify WebSocket connectivity

  • Performance: Adjust VAD sensitivity and audio processing settings

  • Issues: GitHub Issues

  • Discussions: GitHub Discussions

  • Documentation: Wiki


🎀 Voice-First AI Assistant - Just Speak and Listen! ❀️

About

A modern, fully voice-activated conversational AI assistant web application with continuous listening, real-time VAD processing, and auto-fading responses. Built using WebRTC, open-source LLMs, and advanced audio processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published