A web-based AI voice assistant that integrates speech recognition, AI conversation, and speech synthesis. Users can interact with AI through real-time voice conversations.
- Real-time speech recognition (based on Baidu Speech Service)
- AI conversation (powered by Gemini API)
- Text-to-Speech (TTS)
- Real-time audio visualization
- Automatic voice input detection
- Support for interruption and continuous dialogue
- Node.js 16.0 or above
- Modern browser (with WebAudio API support)
- Internet connection (for API calls)
- Python 3.7 or above
- edge-tts library (for speech synthesis)
# Install pnpm globally (if not installed)
npm install -g pnpm
# Install project dependencies
pnpm install
# Backend dependencies
cd Backend
pip install -r requirements.txt
# Navigate to backend directory
cd Backend
# Start backend service
python edge-tts.py
Note: The backend service must run on port 8000 for the speech synthesis feature to work properly.
- Copy
.env.example
file and rename it to.env
:
cp .env.example .env
- Fill in your API keys in the
.env
file:
# Baidu API Configuration
BAIDU_API_KEY=your_baidu_api_key_here
BAIDU_SECRET_KEY=your_baidu_secret_key_here
BAIDU_TOKEN_URL=https://aip.baidubce.com/oauth/2.0/token
BAIDU_ASR_URL=https://vop.baidu.com/server_api
# Gemini API Configuration
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_API_URL=https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent
# Return to project root
cd ..
# Install frontend dependencies
pnpm install
# Start development server
pnpm dev
Visit http://localhost:5173
to use the voice assistant.
Tip: We recommend using pnpm as the package manager for faster installation and better disk space utilization.
The project is divided into frontend and backend:
-
Frontend: React + Vite web application
- Speech recognition (Baidu Speech Service)
- AI conversation (Gemini API)
- Audio visualization
-
Backend: Python Flask application
- Speech synthesis service (based on edge-tts)
- Running on port 8000
- Visit Baidu Cloud
- Register/Login
- Create a speech application
- Get API Key and Secret Key
- Visit Google AI Studio
- Login with Google account
- Create API key
- Never commit
.env
file containing actual API keys to version control - Ensure
.env
file is added to.gitignore
- Regularly rotate API keys
- Implement additional security measures for production environment
- Open the webpage and allow browser microphone access
- Wait for "Ready" prompt
- Start speaking, the system will automatically detect voice input
- AI assistant will respond through voice
- You can interrupt AI's response at any time to continue asking questions
-
If speech recognition is inaccurate, ensure:
- Microphone is working properly
- Environment noise is minimal
- Speaking at a moderate pace
-
If API calls fail, check:
- API keys are correct
- Network connection is stable
- API usage quotas are not exceeded
For issues or suggestions, please submit an Issue or Pull Request.
MIT License
This project references the following excellent open-source projects:
- transformers.js-examples/moonshine-web - HuggingFace's Transformers.js example project
- edge-tts - Microsoft Edge TTS-based online speech synthesis service