A modern, fully voice-activated conversational AI assistant web application with continuous listening, real-time VAD processing, and auto-fading responses. Built using WebRTC, open-source LLMs, and advanced audio processing.
- π₯ Continuous Listening: Auto-starts on page load - no buttons required!
- π― Voice Activity Detection: Real-time VAD with
webrtcvadprocessing - β‘ Live Audio Streaming: WebSocket-based continuous audio pipeline
- ποΈ Smart Voice Processing: Noise suppression, auto-gain, echo cancellation
- π± Button-Free Experience: Pure voice interaction - just speak naturally!
- β¨ Auto-Generated Responses: Instant AI responses to voice input
- π£οΈ Text-to-Speech: Automatic speech synthesis and playback
- β° 20-Second Auto-Fade: Responses fade away after TTS + 20 seconds
- π Continuous Loop: Always ready for the next voice interaction
- π Live Audio Visualizer: Real-time voice activity display
- π VAD Status Indicators: Visual feedback for listening state
- π Smooth Animations: Framer Motion powered transitions
- π Voice-First Design: Minimal, distraction-free interface
- π― Real-time VAD:
webrtcvadwith configurable sensitivity - π Audio Normalization: Automatic level adjustment
- ποΈ Noise Suppression: Advanced filtering algorithms
- π‘ WebRTC Streaming: Low-latency audio transmission
- β‘ WebSocket Streaming: Continuous audio chunk processing
- π― VAD-Triggered Transcription: Process speech only when detected
- π Auto-Reconnection: Robust connection management
- π Live Status Monitoring: Real-time connection health
- π¦ Open-Source LLMs: Access to Llama, Mistral, CodeLlama via OpenRouter
- π§ Speech Recognition: OpenAI Whisper integration
- π£οΈ Text-to-Speech: Coqui TTS with multiple voices
- π¬ Streaming Responses: Real-time text generation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LIVE VOICE INTERFACE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β π€ Auto-Start β π― VAD β π ASR β π€ LLM β π£οΈ TTS β β° Fade β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ WebSocket βββββββββββββββββββ
β React Frontend β ββββββββββββββββΊ β FastAPI Backend β
β β (Live Audio) β β
β β’ Live Audio β β β’ VAD Service β
β β’ Voice Visualizerβ β β’ ASR Service β
β β’ Auto-Fade UI β β β’ LLM Service β
β β’ No Buttons! β β β’ TTS Service β
βββββββββββββββββββ β β’ Streaming β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β External APIs β
β β
β β’ OpenRouter β
β β’ Whisper ASR β
β β’ Coqui TTS β
β β’ WebRTC VAD β
βββββββββββββββββββ
- Docker & Docker Compose
- OpenRouter API Key (for LLM and Whisper access)
- Microphone permissions (for voice input)
git clone <repository-url>
cd AI-Convo-WebappCopy the example environment file and configure your settings:
cp env.example .envEdit .env with your API keys:
# Required: OpenRouter API Key
OPENROUTER_API_KEY=your_openrouter_api_key_here
# Security
SECRET_KEY=your_secret_key_here_make_it_long_and_random
# Voice Activity Detection
VAD_MODE=3 # VAD sensitivity (0-3, higher = more sensitive)
SAMPLE_RATE=16000 # Audio sample rate for VAD
CHUNK_DURATION_MS=30 # VAD processing chunk size
# Audio Processing
ENABLE_NOISE_SUPPRESSION=true
ENABLE_AUTO_GAIN=true
ENABLE_ECHO_CANCELLATION=true
# TTS Configuration
TTS_MODEL=tts_models/en/ljspeech/tacotron2-DDC
TTS_DEFAULT_VOICE=ljspeech
RESPONSE_FADE_DELAY=20 # Seconds to fade response after TTS# Start all services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down- Voice App: http://localhost:3000
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Open the app β http://localhost:3000
- Grant microphone permissions when prompted
- Start speaking β The app automatically starts listening!
π€ Speak β π― VAD Detects β π Transcribes β π€ AI Responds β π£οΈ TTS Plays β β° Fades (20s)
- π’ Green Pulse: Listening and ready
- π΄ Red Ripple: Voice activity detected
- π‘ Yellow: Processing your speech
- π΅ Blue: AI generating response
- Speak clearly and at normal volume
- Wait for the pulse to show it's listening
- Pause briefly between sentences
- Stay within 3 feet of microphone for best VAD detection
cd backend
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run development server with VAD
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000cd frontend
# Install dependencies
npm install
# Start development server (with live audio)
npm start
# Build for production
npm run buildWS /api/v1/streaming/ws- Live audio streaming endpoint- Message Types:
audio_chunk- Raw audio data for VAD processingtranscription_request- Request speech transcriptionchat_request- Send message to AItts_request- Generate speech audio
POST /api/v1/audio/process- Process audio with VADPOST /api/v1/audio/normalize- Normalize audio levelsGET /api/v1/audio/vad-status- Check VAD configuration
POST /api/v1/asr/transcribe- Transcribe audio filePOST /api/v1/asr/transcribe-streaming- Stream transcription
POST /api/v1/chat/generate- Generate AI responsePOST /api/v1/chat/stream- Streaming response
POST /api/v1/tts/synthesize- Synthesize speechGET /api/v1/tts/voices- Available voices
| Variable | Description | Default |
|---|---|---|
VAD_MODE |
VAD sensitivity (0-3) | 3 |
SAMPLE_RATE |
Audio sample rate | 16000 |
CHUNK_DURATION_MS |
VAD processing chunk size | 30 |
SILENCE_THRESHOLD_MS |
Silence detection threshold | 500 |
| Variable | Description | Default |
|---|---|---|
ENABLE_NOISE_SUPPRESSION |
Enable noise filtering | true |
ENABLE_AUTO_GAIN |
Enable auto-gain control | true |
ENABLE_ECHO_CANCELLATION |
Enable echo cancellation | true |
AUDIO_NORMALIZATION_TARGET |
Target audio level | 5000 |
| Variable | Description | Default |
|---|---|---|
RESPONSE_FADE_DELAY |
Seconds to fade after TTS | 20 |
TYPING_ANIMATION_SPEED |
Typing effect speed (ms) | 50 |
AUTO_PLAY_TTS |
Auto-play TTS responses | true |
# In backend/app/config.py
VAD_MODE = 3 # 0 = least sensitive, 3 = most sensitive// In frontend/src/components/ResponseDisplay.tsx
const FADE_DELAY = 20000; // 20 seconds (change as needed)# In backend/app/services/audio_service.py
class AudioService:
def process_audio_chunk(self, audio_data: bytes):
# Add your custom VAD logic here
pass# Set production environment variables
export ENVIRONMENT=production
export VAD_MODE=3
export ENABLE_NOISE_SUPPRESSION=true
export RESPONSE_FADE_DELAY=20
# Deploy with Docker
docker-compose --profile production up -d- VAD Processing: Adjust
CHUNK_DURATION_MSfor latency vs accuracy - Audio Quality: Set
SAMPLE_RATEto 16000 for optimal VAD performance - Memory Usage: Configure audio buffer sizes based on usage patterns
# Test VAD sensitivity
curl -X POST http://localhost:8000/api/v1/audio/test-vad \
-H "Content-Type: application/json" \
-d '{"sensitivity": 3}'
# Test audio processing pipeline
curl -X POST http://localhost:8000/api/v1/audio/test-pipeline \
-F "audio=@test_audio.wav"- Open Developer Tools β Console
- Check microphone permissions
- Test WebSocket connection
- Verify VAD detection
- Frontend: React components for live audio visualization
- Backend: VAD and streaming audio processing
- WebSocket: Real-time audio chunk handling
- Audio Processing: VAD algorithms and noise suppression
- Test with different microphone setups
- Verify VAD sensitivity in various noise conditions
- Test auto-fade timing and TTS integration
- Validate WebSocket connection stability
This project is licensed under the MIT License - see the LICENSE file for details.
- WebRTC VAD for real-time voice activity detection
- OpenRouter for providing access to open-source LLMs
- OpenAI for Whisper speech recognition
- Coqui AI for open-source Text-to-Speech
- Material-UI for the beautiful component library
- FastAPI for the high-performance backend framework
For voice interface issues:
-
Audio Problems: Check microphone permissions and VAD settings
-
Connection Issues: Verify WebSocket connectivity
-
Performance: Adjust VAD sensitivity and audio processing settings
-
Issues: GitHub Issues
-
Discussions: GitHub Discussions
-
Documentation: Wiki
π€ Voice-First AI Assistant - Just Speak and Listen! β€οΈ