Skip to content

EmbeddedSciUM ChatBot is a real-time voice assistant for ESP32 using Deepgram for STT/TTS and Gemini API for AI responses. It captures audio via ISR, streams to the cloud, processes conversation history, and plays merged audio through a custom speaker system — all designed for low-latency embedded applications

Notifications You must be signed in to change notification settings

SahilKumarSingh01/EmbeddedsciumChatBot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

EmbeddedSciUM ChatBot – ESP32 Voice Assistant

This project implements a fully embedded voice assistant on ESP32 hardware. It uses Deepgram for speech-to-text (STT) and text-to-speech (TTS), and the Gemini API for language model responses. The system is modular and designed for real-time audio processing with minimal latency. The software architecture consists of the following major components:

  • AudioRecorder: Captures audio using ADC via ISR
  • STT: Streams audio to Deepgram using WebSocket
  • GeminiChat: Manages structured queries and conversation history
  • TTS: Converts text responses to audio via Deepgram
  • Speaker: Mixes and outputs audio from multiple sources
  • WiFi Configuration: Captive portal for dynamic WiFi setup

All components are independent and can be used separately or together.


System Overview

The system supports continuous streaming. Audio is captured using interrupts (ISR) instead of continuous ADC with DMA to prevent data loss during WiFi activity. User Microphone → AudioRecorder (GPIO32 ADC) → STT (Deepgram WebSocket) → GeminiChat (LLM) → TTS (Deepgram Post) → Speaker (GPIO35 DAC via WiFiClient)


Components

AudioRecorder

  • Captures audio from ADC (GPIO32) using interrupts (ISR)
  • Avoids DMA continuous mode due to ESP32 hardware limitations:
    • DMA stops when WiFi is active
    • Power draw from WiFi causes voltage fluctuation and audio loss
  • Buffers audio for real-time streaming

STT

  • Streams audio data to Deepgram using WebSocket
  • Chosen over HTTP POST to reduce latency
    • No need to wait for recording file to complete
    • Real-time transcription
  • Processes audio in chunks directly from AudioRecorder

GeminiChat

  • Sends structured requests to Gemini API
  • Stores conversation history; history size is configurable (default 50 messages)
  • Supports multiple turns and structured context management
  • Returns AI responses in a structured format suitable for TTS

TTS

  • Sends text responses to Deepgram TTS
  • Streams audio chunks to Speaker without saving to intermediate files
  • Works in real-time for continuous interaction
  • Can operate independently of GeminiChat for static TTS generation

Speaker

  • Outputs audio to GPIO35 DAC/I2S
  • Supports multiple audio instances simultaneously
  • Merges audio streams with different sample rates into a single output
  • Designed to handle real-time audio mixing
  • Can operate independently for any audio playback

WiFi Configuration

  • Implements a captive portal using ESP32 WebServer and DNS redirect
  • On first boot or failed WiFi connection:
    • ESP32 starts an access point
    • User connects via mobile or computer
    • WiFi SSID and password can be entered and stored
    • ESP32 reconnects automatically on next boot

Hardware

  • ESP32 microcontroller
  • Microphone connected to GPIO32 (ADC input)
  • Speaker or DAC output connected to GPIO35
  • Stable 5V power supply recommended
  • No additional microcontrollers required

Software Design Decisions

  1. ISR for Audio Recording
    • Continuous ADC with DMA fails when WiFi is active
    • ISR ensures no audio loss even under high WiFi load
  2. WebSocket for STT
    • Continuous streaming instead of file-based HTTP POST
    • Reduces latency and increases reliability for live audio
  3. Modular Classes
    • AudioRecorder, STT, GeminiChat, TTS, Speaker can operate independently
    • Allows flexible integration in other projects
  4. Speaker Audio Merging
    • Handles multiple streams with different sample rates
    • Provides single coherent audio output
  5. Configurable LLM History
    • GeminiChat allows custom number of conversation turns
    • Default 50 messages
  6. Captive Portal WiFi Configuration
    • Eliminates hardcoding credentials
    • Works with any WiFi network

File Structure


Usage

  1. Power ESP32
  2. Connect to the ESP32 WiFi AP for configuration (if not previously configured)
  3. Enter WiFi SSID and password via captive portal
  4. Once connected, AudioRecorder captures audio continuously
  5. STT streams audio to Deepgram in real-time
  6. GeminiChat handles conversation context and sends responses
  7. TTS streams the response back to Speaker
  8. Speaker plays mixed audio output

Notes

  • All classes are designed to function independently
  • AudioRecorder and Speaker support concurrent instances
  • WebSocket STT is faster than HTTP POST and suitable for low-latency applications
  • ISR-based audio capture is critical for avoiding dropped data under WiFi load
  • Speaker merges multiple streams for unified output regardless of sample rate differences

Demo

Watch Demo Video
LinkedIn Demo Video


License

  • Free to use, modify, and integrate any part of this code
  • Star the repository if it is useful

About

EmbeddedSciUM ChatBot is a real-time voice assistant for ESP32 using Deepgram for STT/TTS and Gemini API for AI responses. It captures audio via ISR, streams to the cloud, processes conversation history, and plays merged audio through a custom speaker system — all designed for low-latency embedded applications

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published