This project implements a fully embedded voice assistant on ESP32 hardware. It uses Deepgram for speech-to-text (STT) and text-to-speech (TTS), and the Gemini API for language model responses. The system is modular and designed for real-time audio processing with minimal latency. The software architecture consists of the following major components:
- AudioRecorder: Captures audio using ADC via ISR
- STT: Streams audio to Deepgram using WebSocket
- GeminiChat: Manages structured queries and conversation history
- TTS: Converts text responses to audio via Deepgram
- Speaker: Mixes and outputs audio from multiple sources
- WiFi Configuration: Captive portal for dynamic WiFi setup
All components are independent and can be used separately or together.
The system supports continuous streaming. Audio is captured using interrupts (ISR) instead of continuous ADC with DMA to prevent data loss during WiFi activity. User Microphone → AudioRecorder (GPIO32 ADC) → STT (Deepgram WebSocket) → GeminiChat (LLM) → TTS (Deepgram Post) → Speaker (GPIO35 DAC via WiFiClient)
- Captures audio from ADC (GPIO32) using interrupts (ISR)
- Avoids DMA continuous mode due to ESP32 hardware limitations:
- DMA stops when WiFi is active
- Power draw from WiFi causes voltage fluctuation and audio loss
- Buffers audio for real-time streaming
- Streams audio data to Deepgram using WebSocket
- Chosen over HTTP POST to reduce latency
- No need to wait for recording file to complete
- Real-time transcription
- Processes audio in chunks directly from AudioRecorder
- Sends structured requests to Gemini API
- Stores conversation history; history size is configurable (default 50 messages)
- Supports multiple turns and structured context management
- Returns AI responses in a structured format suitable for TTS
- Sends text responses to Deepgram TTS
- Streams audio chunks to Speaker without saving to intermediate files
- Works in real-time for continuous interaction
- Can operate independently of GeminiChat for static TTS generation
- Outputs audio to GPIO35 DAC/I2S
- Supports multiple audio instances simultaneously
- Merges audio streams with different sample rates into a single output
- Designed to handle real-time audio mixing
- Can operate independently for any audio playback
- Implements a captive portal using ESP32 WebServer and DNS redirect
- On first boot or failed WiFi connection:
- ESP32 starts an access point
- User connects via mobile or computer
- WiFi SSID and password can be entered and stored
- ESP32 reconnects automatically on next boot
- ESP32 microcontroller
- Microphone connected to GPIO32 (ADC input)
- Speaker or DAC output connected to GPIO35
- Stable 5V power supply recommended
- No additional microcontrollers required
- ISR for Audio Recording
- Continuous ADC with DMA fails when WiFi is active
- ISR ensures no audio loss even under high WiFi load
- WebSocket for STT
- Continuous streaming instead of file-based HTTP POST
- Reduces latency and increases reliability for live audio
- Modular Classes
- AudioRecorder, STT, GeminiChat, TTS, Speaker can operate independently
- Allows flexible integration in other projects
- Speaker Audio Merging
- Handles multiple streams with different sample rates
- Provides single coherent audio output
- Configurable LLM History
- GeminiChat allows custom number of conversation turns
- Default 50 messages
- Captive Portal WiFi Configuration
- Eliminates hardcoding credentials
- Works with any WiFi network
- Power ESP32
- Connect to the ESP32 WiFi AP for configuration (if not previously configured)
- Enter WiFi SSID and password via captive portal
- Once connected, AudioRecorder captures audio continuously
- STT streams audio to Deepgram in real-time
- GeminiChat handles conversation context and sends responses
- TTS streams the response back to Speaker
- Speaker plays mixed audio output
- All classes are designed to function independently
- AudioRecorder and Speaker support concurrent instances
- WebSocket STT is faster than HTTP POST and suitable for low-latency applications
- ISR-based audio capture is critical for avoiding dropped data under WiFi load
- Speaker merges multiple streams for unified output regardless of sample rate differences
Watch Demo Video
LinkedIn Demo Video
- Free to use, modify, and integrate any part of this code
- Star the repository if it is useful