Welcome to the Voice Assistant project! 🎙️ Our goal is to create a modular voice assistant application that allows you to experiment with state-of-the-art (SOTA) models for various components. The modular structure provides flexibility, enabling you to pick and choose between different SOTA models for transcription, response generation, and text-to-speech (TTS). This approach facilitates easy testing and comparison of different models, making it an ideal platform for research and development in voice assistant technologies. Whether you're a developer, researcher, or enthusiast, this project is for you!
- Modular Design: Easily switch between different models for transcription, response generation, and TTS.
- Support for Multiple APIs: Integrates with OpenAI, Groq, and Deepgram APIs, along with placeholders for local models.
- Audio Recording and Playback: Record audio from the microphone and play generated speech.
- Configuration Management: Centralized configuration in
config.py
for easy setup and management.
voice_assistant/
├── voice_assistant/
│ ├── __init__.py
│ ├── audio.py
│ ├── api_key_manager.py
│ ├── config.py
│ ├── transcription.py
│ ├── response_generation.py
│ ├── text_to_speech.py
│ ├── utils.py
├── .env
├── run_voice_assistant.py
├── setup.py
├── requirements.txt
└── README.md
- Python 3.10 or higher
- Virtual environment (recommended)
- 📥 Clone the repository
git clone https://github.com/yourusername/voice_assistant.git
cd voice_assistant
- 🐍 Set up a virtual environment
Using venv
:
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
Using conda
:
conda create --name jarvis-voice-assistant python=3.9
conda activate jarvis-voice-assistant
- 📦 Install the required packages
pip install -r requirements.txt
- 🛠️ Set up the environment variables
Create a .env
file in the root directory and add your API keys:
OPENAI_API_KEY=your_openai_api_key
GROQ_API_KEY=your_groq_api_key
DEEPGRAM_API_KEY=your_deepgram_api_key
LOCAL_MODEL_PATH=path/to/local/model
- 🧩 Configure the models
Edit config.py to select the models you want to use:
class Config:
# Model selection
TRANSCRIPTION_MODEL = 'groq' # Options: 'openai', 'groq', 'deepgram', 'local'
RESPONSE_MODEL = 'groq' # Options: 'openai', 'groq', 'local'
TTS_MODEL = 'deepgram' # Options: 'openai', 'deepgram', 'local'
# API keys and paths
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
DEEPGRAM_API_KEY = os.getenv("DEEPGRAM_API_KEY")
LOCAL_MODEL_PATH = os.getenv("LOCAL_MODEL_PATH")
- 🏃 Run the voice assistant
python run_voice_assistant.py
- OpenAI: Uses OpenAI's Whisper model.
- Groq: Uses Groq's Whisper-large-v3 model.
- Deepgram: Placeholder for Deepgram's transcription model.
- Local: Placeholder for a local speech-to-text (STT) model.
- OpenAI: Uses OpenAI's GPT-4 model.
- Groq: Uses Groq's LLaMA model.
- Local: Placeholder for a local language model.
- OpenAI: Uses OpenAI's TTS model with the "fable" voice.
- Deepgram: Uses Deepgram's TTS model with the "aura-angus-en" voice.
- Local: Placeholder for a local TTS model.
run_voice_assistant.py
: Main script to run the voice assistant.voice_assistant/config.py
: Manages configuration settings and API keys.voice_assistant/api_key_manager.py
: Handles retrieval of API keys based on configured models.voice_assistant/audio.py
: Functions for recording and playing audio.voice_assistant/transcription.py
: Manages audio transcription using various APIs.voice_assistant/response_generation.py
: Handles generating responses using various language models.voice_assistant/text_to_speech.py
: Manages converting text responses into speech.voice_assistant/utils.py
: Contains utility functions like deleting files.voice_assistant/__init__.py
: Initializes thevoice_assistant
package.
Here's what's next for the Voice Assistant project:
- Add Support for Streaming: Enable real-time streaming of audio input and output.
- Add Support for ElevenLabs and Enhanced Deepgram for TTS: Integrate additional TTS options for higher quality and variety.
- Add Filler Audios: Include background or filler audios while waiting for model responses to enhance user experience.
- Add Support for Local Models Across the Board: Expand support for local models in transcription, response generation, and TTS.
We welcome contributions from the community! If you'd like to help improve this project, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Make your changes and commit them (
git commit -m 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Open a pull request detailing your changes.