EmbeddedSciUM ChatBot – ESP32 Voice Assistant

This project implements a fully embedded voice assistant on ESP32 hardware. It uses Deepgram for speech-to-text (STT) and text-to-speech (TTS), and the Gemini API for language model responses. The system is modular and designed for real-time audio processing with minimal latency. The software architecture consists of the following major components:

AudioRecorder: Captures audio using ADC via ISR
STT: Streams audio to Deepgram using WebSocket
GeminiChat: Manages structured queries and conversation history
TTS: Converts text responses to audio via Deepgram
Speaker: Mixes and outputs audio from multiple sources
WiFi Configuration: Captive portal for dynamic WiFi setup

All components are independent and can be used separately or together.

System Overview

The system supports continuous streaming. Audio is captured using interrupts (ISR) instead of continuous ADC with DMA to prevent data loss during WiFi activity. User Microphone → AudioRecorder (GPIO32 ADC) → STT (Deepgram WebSocket) → GeminiChat (LLM) → TTS (Deepgram Post) → Speaker (GPIO35 DAC via WiFiClient)

Components

AudioRecorder

Captures audio from ADC (GPIO32) using interrupts (ISR)
Avoids DMA continuous mode due to ESP32 hardware limitations:
- DMA stops when WiFi is active
- Power draw from WiFi causes voltage fluctuation and audio loss
Buffers audio for real-time streaming

STT

Streams audio data to Deepgram using WebSocket
Chosen over HTTP POST to reduce latency
- No need to wait for recording file to complete
- Real-time transcription
Processes audio in chunks directly from AudioRecorder

GeminiChat

Sends structured requests to Gemini API
Stores conversation history; history size is configurable (default 50 messages)
Supports multiple turns and structured context management
Returns AI responses in a structured format suitable for TTS

TTS

Sends text responses to Deepgram TTS
Streams audio chunks to Speaker without saving to intermediate files
Works in real-time for continuous interaction
Can operate independently of GeminiChat for static TTS generation

Speaker

Outputs audio to GPIO35 DAC/I2S
Supports multiple audio instances simultaneously
Merges audio streams with different sample rates into a single output
Designed to handle real-time audio mixing
Can operate independently for any audio playback

WiFi Configuration

Implements a captive portal using ESP32 WebServer and DNS redirect
On first boot or failed WiFi connection:
- ESP32 starts an access point
- User connects via mobile or computer
- WiFi SSID and password can be entered and stored
- ESP32 reconnects automatically on next boot

Hardware

ESP32 microcontroller
Microphone connected to GPIO32 (ADC input)
Speaker or DAC output connected to GPIO35
Stable 5V power supply recommended
No additional microcontrollers required

Software Design Decisions

ISR for Audio Recording
- Continuous ADC with DMA fails when WiFi is active
- ISR ensures no audio loss even under high WiFi load
WebSocket for STT
- Continuous streaming instead of file-based HTTP POST
- Reduces latency and increases reliability for live audio
Modular Classes
- AudioRecorder, STT, GeminiChat, TTS, Speaker can operate independently
- Allows flexible integration in other projects
Speaker Audio Merging
- Handles multiple streams with different sample rates
- Provides single coherent audio output
Configurable LLM History
- GeminiChat allows custom number of conversation turns
- Default 50 messages
Captive Portal WiFi Configuration
- Eliminates hardcoding credentials
- Works with any WiFi network

File Structure

Usage

Power ESP32
Connect to the ESP32 WiFi AP for configuration (if not previously configured)
Enter WiFi SSID and password via captive portal
Once connected, AudioRecorder captures audio continuously
STT streams audio to Deepgram in real-time
GeminiChat handles conversation context and sends responses
TTS streams the response back to Speaker
Speaker plays mixed audio output

Notes

All classes are designed to function independently
AudioRecorder and Speaker support concurrent instances
WebSocket STT is faster than HTTP POST and suitable for low-latency applications
ISR-based audio capture is critical for avoiding dropped data under WiFi load
Speaker merges multiple streams for unified output regardless of sample rate differences

Demo

Watch Demo Video
LinkedIn Demo Video

License

Free to use, modify, and integrate any part of this code
Star the repository if it is useful

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
EmbeddedsciumChatBot		EmbeddedsciumChatBot
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EmbeddedSciUM ChatBot – ESP32 Voice Assistant

System Overview

Components

AudioRecorder

STT

GeminiChat

TTS

Speaker

WiFi Configuration

Hardware

Software Design Decisions

File Structure

Usage

Notes

Demo

License

About

Uh oh!

Releases

Packages

Languages

SahilKumarSingh01/EmbeddedsciumChatBot

Folders and files

Latest commit

History

Repository files navigation

EmbeddedSciUM ChatBot – ESP32 Voice Assistant

System Overview

Components

AudioRecorder

STT

GeminiChat

TTS

Speaker

WiFi Configuration

Hardware

Software Design Decisions

File Structure

Usage

Notes

Demo

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages