Real-time multimodal AI agent powered by Google Gemini 2.5 Flash Native Audio that sees your workspace through the camera and responds with live voice.
Hephaestus connects to Gemini's Live API over a persistent WebSocket, streams camera frames in real time, and plays back the AI's spoken responses directly in the browser — no plugins, no external services.
| Feature | Status |
|---|---|
| Persistent WebSocket to Gemini Live API | ✅ Stable |
| Camera feed streaming (JPEG frames every 3 s) | ✅ Live |
| Text input → AI response | ✅ Working |
| AI voice response (PCM audio playback) | ✅ Working |
| Thought-part filtering (internal reasoning hidden) | ✅ Fixed |
| WinError 10054 / keepalive ping suppression | ✅ Fixed |
| Mute / unmute AI voice | ✅ Working |
| SPEAKING badge while AI talks | ✅ Working |
| Auto-reconnect on disconnect | ✅ Working |
| SDK warning suppression | ✅ Fixed |
- Python 3.11+
- Node.js 18+
- Webcam
- Google Gemini API Key — Get one free here
Must have access to
gemini-2.5-flash-native-audio-preview(Live API)
git clone https://github.com/SamoTech/hephaestus-live-agent.git
cd hephaestus-live-agentcd backend
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS / Linux
pip install -r requirements.txt
cp .env.example .env
# Open .env and set GEMINI_API_KEY=your_key_here
python main.py
# → Uvicorn running on http://0.0.0.0:8000cd frontend
npm install
npm run dev
# → http://localhost:5173- Open
http://localhost:5173 - Click the ▶ orange button to start the camera
- Grant camera permission
- Type a message — press Enter or click Send
- You will hear Gemini speak the response through your speakers
- The SPEAKING badge pulses on the camera panel while audio plays
- Use the 🔊 button in the header to mute/unmute the AI voice
Browser (React + Vite)
│
│ WebSocket ws://localhost:8000/ws/live
│ ├─ → { type: "text", text: "..." } User message
│ ├─ → { type: "image", data: "<base64jpeg>" } Camera frame
│ ├─ ← { type: "model_audio", data: "<pcm>" } AI voice (streamed)
│ ├─ ← { type: "model_text", text: "..." } AI transcript
│ ├─ ← { type: "audio_start" } Speaking indicator
│ └─ ← { type: "system" / "error" } Status / errors
│
FastAPI (Uvicorn)
│
│ google-genai Live SDK
└─ Gemini 2.5 Flash Native Audio (Live API)
├─ response.data → raw 16-bit PCM @ 24 kHz
└─ server_content.parts → text (thoughts filtered)
Audio pipeline:
Gemini returns raw 16-bit little-endian PCM at 24 kHz mono. The backend base64-encodes each chunk and sends it over WebSocket. The frontend decodes it into Float32Array and schedules it on a Web Audio API AudioContext with gapless back-to-back buffering.
hephaestus-live-agent/
├── backend/
│ ├── main.py # FastAPI app + WebSocket handler + Gemini bridge
│ ├── requirements.txt
│ ├── .env.example
│ ├── Dockerfile
│ └── config/
│ ├── settings.py # GEMINI_MODEL, HOST, PORT, CORS, etc.
│ └── prompts.py # System prompts (default, engineering, dev, education, creative)
│
├── frontend/
│ ├── src/
│ │ ├── App.jsx # Main UI — camera, logs, audio player, WebSocket
│ │ ├── main.jsx
│ │ └── index.css
│ ├── tailwind.config.js # Custom colors: hephaestus-dark, hephaestus-panel, hephaestus-orange
│ ├── vite.config.js
│ └── package.json
│
├── docs/
│ ├── API.md
│ ├── ARCHITECTURE.md
│ ├── DEPLOYMENT.md
│ └── i18n/ # Translations
│ ├── README.ar.md # العربية
│ ├── README.zh.md # 中文
│ ├── README.es.md # Español
│ ├── README.fr.md # Français
│ ├── README.de.md # Deutsch
│ ├── README.ja.md # 日本語
│ └── README.ru.md # Русский
│
├── docker-compose.yml
├── CHANGELOG.md
├── CONTRIBUTING.md
└── README.md
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Service info, version, active model |
/health |
GET | Health check + Gemini readiness |
/models |
GET | List all available Gemini models |
/models/live |
GET | List only Live-API-capable models |
/ws/live |
WebSocket | Bidirectional AI streaming |
Client → Server
{ "type": "text", "text": "what do you see?" }
{ "type": "image", "data": "<base64>", "mime_type": "image/jpeg" }Server → Client
{ "type": "system", "text": "Connected to Hephaestus AI. Camera feed active." }
{ "type": "model_text", "text": "I can see a circuit board..." }
{ "type": "model_audio", "data": "<base64 PCM>", "mime_type": "audio/pcm;rate=24000" }
{ "type": "audio_start" }
{ "type": "error", "text": "description" }All configuration lives in backend/.env. Copy from .env.example:
# Required
GEMINI_API_KEY=your_api_key_here
# Optional — defaults shown
GEMINI_MODEL=models/gemini-2.5-flash-native-audio-preview-12-2025
HOST=0.0.0.0
PORT=8000
LOG_LEVEL=INFO
CORS_ORIGINS=["http://localhost:5173","http://localhost:3000"]System prompts can be swapped by editing backend/config/prompts.py. Available modes: default, engineering, developer, education, creative.
cp backend/.env.example backend/.env
# Set GEMINI_API_KEY in backend/.env
docker compose up --build
# Backend → http://localhost:8000
# Frontend → http://localhost:5173v1.0 — Core (✅ Complete)
- Stable WebSocket ↔ Gemini Live API bridge
- Camera frame streaming
- Text input / output
- Live PCM audio playback (Web Audio API)
- Thought-part filtering
- Mute toggle + SPEAKING indicator
- Auto-reconnect with Windows WinError 10054 fix
v1.1 — Microphone Input (🚧 Next)
- Browser microphone → 16-bit PCM → Gemini Live audio input
- Push-to-talk and voice activity detection
- Full duplex voice conversation
v1.2 — Agentic Tools (📋 Planned)
- Web search integration
- Code generation + file save
- Session export (transcript + audio)
- Conversation history and memory
v2.0 — Production (🔮 Future)
- Docker + cloud deployment
- User authentication
- Multi-user sessions
- Mobile app
العربية · 中文 · Español · Français · Deutsch · 日本語 · Русский
PRs are welcome! See CONTRIBUTING.md for guidelines.
# Run backend in dev mode with auto-reload
uvicorn main:app --reload --port 8000
# Run frontend in dev mode
npm run dev- Issues / Bugs: GitHub Issues
- Discussions: GitHub Discussions
- Email: samo.hossam@gmail.com
- X / Twitter: @OssamaHashim
- LinkedIn: Ossama Hashim
MIT — see LICENSE for details.
Built by Ossama Hashim (SamoTech) · Hephaestus — the AI that sees your world.