Upload a photo Β· Clone a voice Β· Talk to any face in real time
Quick Start Β· Features Β· Architecture Β· Voice Cloning Β· API Β· Deploy Β· Roadmap Β· Contribute
The most complete open-source AI talking avatar system. Real-time lip-sync Β· Zero-shot voice cloning Β· Multi-LLM Β· Runs 100% locally.
AvatarAI is an open-source, production-ready platform for building photorealistic AI avatar conversations. Upload any face photo, clone a voice from a 5-second audio clip, and have a real-time conversation β with lip-sync video generated on every single response.
[mic input] β Whisper STT β Claude / GPT-4 / Llama β XTTS v2 TTS β MuseTalk lip-sync β [video]
< 200 ms end-to-end on GPU >
What makes AvatarAI different:
- π€ Zero-shot voice cloning β 5 seconds of audio is all you need (XTTS v2)
- π Any face, any language β upload a JPEG, pick from 18 languages, start talking
- β‘ True real-time WebSocket pipeline β no polling, no page reloads
- π 100% local mode β nothing leaves your machine
- π 3 LLM backends β Claude, GPT-4, or Llama 3 via Ollama (free, local)
- ποΈ Production-ready β JWT auth, rate limiting, Prometheus, Sentry, CI/CD
| Category | Details |
|---|---|
| π€ LLM Backends | Claude (Anthropic) Β· GPT-4o (OpenAI) Β· Llama 3 (Ollama, local) |
| π€ Voice Cloning | Record 5β30 s β XTTS v2 zero-shot speaker embedding, applied every TTS call |
| π£οΈ Speech-to-Text | OpenAI Whisper (faster-whisper, CUDA-accelerated), 18+ languages |
| π¬ Lip-Sync Video | MuseTalk Β· Simple fallback β photorealistic, every response |
| β‘ Real-Time Pipeline | WebSocket: STT β LLM β TTS β Animator β video in one continuous pass |
| π Emotion Detection | Live emotion badges per message (happy Β· angry Β· sad Β· excited Β· curious) |
| π 18+ Languages | Whisper multilingual STT + XTTS v2 multilingual TTS |
| π Local-First Storage | USE_LOCAL_STORAGE=true β files served from uploads/, no AWS needed |
| π Auth & Sessions | JWT authentication, conversation history, persistent sessions |
| π Observability | Prometheus metrics Β· Celery Flower Β· Sentry error tracking Β· request logging |
| π§ͺ Tested | Full pytest suite β users, avatars, sessions, health checks |
| π CI/CD Ready | GitHub Actions: lint + test + Docker build + deploy |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Browser / Client β
β βββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββ β
β β Avatar Studio β β Voice Studio β β Chat Interface β β
β β (upload) β β (cloning) β β WebSocket + video β β
β βββββββββ¬ββββββββ βββββββββ¬βββββββββ ββββββββββββ¬ββββββββββββββ β
ββββββββββββΌββββββββββββββββββΌββββββββββββββββββββββββΌβββββββββββββββββ
β REST β REST β WebSocket
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
β β /avatars β β /voices β β /sessionsβ β /messages β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β WebSocket Manager β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
β β Whisper β β Claude / β β XTTS v2 β β MuseTalk / β β
β β STT β β GPT/Llamaβ β TTS β β MuseTalk β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
β βPostgreSQLβ β Redis β β Celery β β Local FS / S3 β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[Mic / Text Input]
β
βΌ
Whisper STT βββββββββββββββΊ transcript text
β
βΌ
Claude / GPT-4 / Llama ββββΊ assistant response
β
βΌ
XTTS v2 TTS βΊ audio WAV
(+ cloned speaker_wav)
β
βΌ
MuseTalk βΊ lip-sync MP4
β
βΌ
WebSocket push ββββββββββββΊ browser <video> element
- Docker & Docker Compose (recommended β one command setup)
- OR: Python 3.11+, Node.js 18+, FFmpeg, PostgreSQL, Redis
git clone https://github.com/PunithVT/ai-avatar-system.git
cd ai-avatar-system
cp .env.example .env # set your ANTHROPIC_API_KEY (or OPENAI_API_KEY)
docker compose up -d| Service | URL |
|---|---|
| π₯οΈ Frontend | http://localhost:3000 |
| βοΈ Backend API | http://localhost:8000 |
| π Swagger Docs | http://localhost:8000/docs |
| πΈ Celery Flower | http://localhost:5555 |
| π Prometheus | http://localhost:9090 |
# Backend
cd backend
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env # fill in your API key
alembic upgrade head # run DB migrations
uvicorn main:app --reload --port 8000
# Frontend (new terminal)
cd frontend
npm install
npm run devNo AWS required. All uploads and generated videos are saved to
backend/uploads/and served athttp://localhost:8000/uploads/by default.
AvatarAI ships a full Voice Studio powered by XTTS v2 β state-of-the-art zero-shot voice cloning.
Clone a voice in 3 steps:
- Navigate to the Voice tab β click Clone Voice
- Record 5β30 seconds of clear speech (or upload a WAV/MP3)
- Name it β click Clone This Voice β select it for your session
Once selected, every TTS response uses your cloned voice. Voice selection is sent to the backend via the WebSocket set_voice message.
# Clone a voice from audio
curl -X POST http://localhost:8000/api/v1/voices/clone \
-F "audio=@my_voice.wav" -F "name=My Voice" -F "language=en"
# List all voice profiles
curl http://localhost:8000/api/v1/voices/
# Delete a voice profile
curl -X DELETE http://localhost:8000/api/v1/voices/{voice_id}# Register a new user
POST /api/v1/users/register
{ "email": "user@example.com", "username": "alice", "password": "secret" }
# Login (returns JWT access token)
POST /api/v1/users/login (form: username=... password=...)
# Use token on all protected routes
Authorization: Bearer <access_token>POST /api/v1/avatars/upload Upload avatar image (multipart: file + name)
GET /api/v1/avatars/ List all avatars
GET /api/v1/avatars/{id} Get avatar details
DELETE /api/v1/avatars/{id} Delete avatar
POST /api/v1/voices/clone Clone voice from audio sample
GET /api/v1/voices/ List all voice profiles
GET /api/v1/voices/{id} Get voice profile
DELETE /api/v1/voices/{id} Delete voice profile
POST /api/v1/sessions/create Create session { "avatar_id": "..." }
POST /api/v1/sessions/{id}/end End a session
GET /api/v1/messages/session/{id} Get conversation history
POST /api/v1/messages/send Send a message (REST fallback)
WS /ws/session/{session_id}
Client β Server:
{ "type": "text", "text": "Hello!" }
{ "type": "audio", "audio": "<base64-encoded-webm>" }
{ "type": "set_voice", "voice_wav_path": "/path/to/speaker.wav" }
{ "type": "ping" }Server β Client:
{ "type": "transcription", "text": "Hello!" }
{ "type": "message", "content": "Hi there!", "role": "assistant" }
{ "type": "video", "video_url": "http://localhost:8000/uploads/video.mp4" }
{ "type": "status", "message": "Generating responseβ¦", "stage": "llm" }
{ "type": "error", "message": "Something went wrong" }
{ "type": "pong" }All settings are in .env. Key variables:
# ββ LLM ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
LLM_PROVIDER=anthropic # anthropic | openai | ollama
LLM_MODEL=claude-sonnet-4-20250514
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
# ββ Avatar Animation Engine βββββββββββββββββββββββββββββββββββββββββββ
AVATAR_ENGINE=musetalk # musetalk | simple
# ββ TTS ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TTS_PROVIDER=coqui # coqui (XTTS v2) | elevenlabs | bark
ELEVENLABS_API_KEY=...
# ββ STT ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
WHISPER_MODEL=base # tiny | base | small | medium | large-v3
# ββ Storage ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
USE_LOCAL_STORAGE=true # false = AWS S3
LOCAL_STORAGE_PATH=uploads
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
S3_BUCKET_NAME=...
# ββ Auth βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SECRET_KEY=change-me-in-production
ACCESS_TOKEN_EXPIRE_MINUTES=1440
# ββ Observability ββββββββββββββββββββββββββββββββββββββββββββββββββββ
SENTRY_DSN=https://...@sentry.io/...
PROMETHEUS_ENABLED=true
LOG_LEVEL=INFO| Library | Purpose |
|---|---|
| Next.js 14 + React 18 | App framework |
| TypeScript 5 | Type safety |
| Tailwind CSS | Styling |
| TanStack Query | Server state |
| Zustand | Global client state |
| Canvas API | Real-time waveform visualizer |
| Library | Purpose |
|---|---|
| FastAPI | Async REST API + WebSocket |
| SQLAlchemy 2 (async) | ORM with asyncpg |
| PostgreSQL 15 | Primary database |
| Alembic | Database migrations |
| Redis 7 | Cache + task queue |
| Celery | Background tasks |
| Prometheus + Sentry | Metrics + error tracking |
| Model | Purpose |
|---|---|
| Claude / GPT-4o / Llama 3 | LLM conversation |
Whisper (faster-whisper) |
Speech-to-text |
| XTTS v2 (Coqui TTS) | Text-to-speech + zero-shot voice cloning |
| MuseTalk V1.5 | Photorealistic lip-sync video generation |
| MuseTalk | Alternative lip-sync animator |
docker compose -f docker-compose.prod.yml up -dcd infrastructure
terraform init
terraform apply -var="environment=production"
./deploy.sh productionSet USE_LOCAL_STORAGE=false and add S3 credentials for cloud storage in production.
cd backend
pip install -r requirements.txt
pytest -v # run all tests
pytest tests/test_health.py # specific module
pytest --cov=app --cov-report=html # with HTML coverage report- Embeddable avatar widget β drop a talking avatar into any website with 3 lines of JS
- Streaming LLM β start TTS before LLM finishes (token-by-token)
- Emotion-driven animation β detected emotion changes the avatar's facial expression
- Multi-avatar conversations β two avatars talking to each other
- Long-term memory β RAG + vector DB for persistent conversation context
- Mobile app β React Native iOS/Android client
- Avatar marketplace β share and download community-created avatars
- Video call integration β replace your face in a live Zoom/Meet call
Q: Do I need a GPU? A: No β everything runs on CPU (slower). An NVIDIA GPU with 8GB+ VRAM is recommended for real-time performance (~200ms latency).
Q: Can I use it with no API key at all?
A: Yes β set LLM_PROVIDER=ollama and run Ollama locally. Fully offline, fully free.
Q: How long does voice cloning take? A: XTTS v2 creates the speaker embedding in ~2 seconds from a 10-second sample. Each TTS call is then ~500ms on GPU.
Q: What file formats are supported for avatar photos? A: JPEG, PNG, WebP. A clear frontal face photo gives the best lip-sync results.
Q: Is this production-ready? A: The platform includes JWT auth, rate limiting, security headers, Sentry error tracking, Prometheus metrics, Alembic migrations, and a full test suite. Ready for private/internal production deployment.
Contributions are welcome! Please read CONTRIBUTING.md before opening a PR. This project follows Conventional Commits.
# Fork & clone
git clone https://github.com/YOUR_USERNAME/ai-avatar-system.git
cd ai-avatar-system
# Create a feature branch
git checkout -b feat/my-feature
# Make changes, write tests, commit
git commit -m "feat(backend): add my feature"
# Push & open a pull request
git push origin feat/my-featureLook for issues tagged good first issue to get started.
MIT Β© 2025 β see LICENSE for details.