Skip to content

PunithVT/ai-avatar-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

75 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎭 AvatarAI β€” Real-Time AI Avatar Platform

Upload a photo Β· Clone a voice Β· Talk to any face in real time

Stars Forks Issues MIT License

Quick Start Β· Features Β· Architecture Β· Voice Cloning Β· API Β· Deploy Β· Roadmap Β· Contribute

The most complete open-source AI talking avatar system. Real-time lip-sync Β· Zero-shot voice cloning Β· Multi-LLM Β· Runs 100% locally.


🎬 What is AvatarAI?

AvatarAI is an open-source, production-ready platform for building photorealistic AI avatar conversations. Upload any face photo, clone a voice from a 5-second audio clip, and have a real-time conversation β€” with lip-sync video generated on every single response.

[mic input]  β†’  Whisper STT  β†’  Claude / GPT-4 / Llama  β†’  XTTS v2 TTS  β†’  MuseTalk lip-sync  β†’  [video]
                                  < 200 ms end-to-end on GPU >

What makes AvatarAI different:

  • 🎀 Zero-shot voice cloning β€” 5 seconds of audio is all you need (XTTS v2)
  • 🎭 Any face, any language β€” upload a JPEG, pick from 18 languages, start talking
  • ⚑ True real-time WebSocket pipeline β€” no polling, no page reloads
  • πŸ”’ 100% local mode β€” nothing leaves your machine
  • πŸ”Œ 3 LLM backends β€” Claude, GPT-4, or Llama 3 via Ollama (free, local)
  • πŸ—οΈ Production-ready β€” JWT auth, rate limiting, Prometheus, Sentry, CI/CD

✨ Features

Category Details
πŸ€– LLM Backends Claude (Anthropic) Β· GPT-4o (OpenAI) Β· Llama 3 (Ollama, local)
🎀 Voice Cloning Record 5–30 s β†’ XTTS v2 zero-shot speaker embedding, applied every TTS call
πŸ—£οΈ Speech-to-Text OpenAI Whisper (faster-whisper, CUDA-accelerated), 18+ languages
🎬 Lip-Sync Video MuseTalk Β· Simple fallback β€” photorealistic, every response
⚑ Real-Time Pipeline WebSocket: STT β†’ LLM β†’ TTS β†’ Animator β†’ video in one continuous pass
😊 Emotion Detection Live emotion badges per message (happy · angry · sad · excited · curious)
🌍 18+ Languages Whisper multilingual STT + XTTS v2 multilingual TTS
🏠 Local-First Storage USE_LOCAL_STORAGE=true β€” files served from uploads/, no AWS needed
πŸ” Auth & Sessions JWT authentication, conversation history, persistent sessions
πŸ“Š Observability Prometheus metrics Β· Celery Flower Β· Sentry error tracking Β· request logging
πŸ§ͺ Tested Full pytest suite β€” users, avatars, sessions, health checks
πŸš€ CI/CD Ready GitHub Actions: lint + test + Docker build + deploy

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          Browser / Client                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Avatar Studio β”‚  β”‚  Voice Studio  β”‚  β”‚    Chat Interface      β”‚  β”‚
β”‚  β”‚  (upload)     β”‚  β”‚  (cloning)     β”‚  β”‚  WebSocket + video     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚  REST            β”‚  REST                 β”‚  WebSocket
           β–Ό                  β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          FastAPI Backend                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ /avatars β”‚  β”‚ /voices  β”‚  β”‚ /sessionsβ”‚  β”‚   /messages      β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                     WebSocket Manager                        β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Whisper  β”‚  β”‚ Claude / β”‚  β”‚ XTTS v2  β”‚  β”‚  MuseTalk /     β”‚    β”‚
β”‚  β”‚   STT    β”‚  β”‚ GPT/Llamaβ”‚  β”‚   TTS    β”‚  β”‚  MuseTalk   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚PostgreSQLβ”‚  β”‚  Redis   β”‚  β”‚  Celery  β”‚  β”‚  Local FS / S3   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

One conversation turn β€” data flow

[Mic / Text Input]
      β”‚
      β–Ό
 Whisper STT ──────────────► transcript text
      β”‚
      β–Ό
 Claude / GPT-4 / Llama ───► assistant response
      β”‚
      β–Ό
 XTTS v2 TTS               β–Ί audio WAV
 (+ cloned speaker_wav)
      β”‚
      β–Ό
 MuseTalk β–Ί lip-sync MP4
      β”‚
      β–Ό
 WebSocket push ───────────► browser <video> element

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose (recommended β€” one command setup)
  • OR: Python 3.11+, Node.js 18+, FFmpeg, PostgreSQL, Redis

Option A β€” Docker (recommended)

git clone https://github.com/PunithVT/ai-avatar-system.git
cd ai-avatar-system
cp .env.example .env          # set your ANTHROPIC_API_KEY (or OPENAI_API_KEY)
docker compose up -d
Service URL
πŸ–₯️ Frontend http://localhost:3000
βš™οΈ Backend API http://localhost:8000
πŸ“– Swagger Docs http://localhost:8000/docs
🌸 Celery Flower http://localhost:5555
πŸ“Š Prometheus http://localhost:9090

Option B β€” Manual (development)

# Backend
cd backend
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env                              # fill in your API key
alembic upgrade head                              # run DB migrations
uvicorn main:app --reload --port 8000

# Frontend (new terminal)
cd frontend
npm install
npm run dev

Open http://localhost:3000

No AWS required. All uploads and generated videos are saved to backend/uploads/ and served at http://localhost:8000/uploads/ by default.


🎀 Voice Cloning

AvatarAI ships a full Voice Studio powered by XTTS v2 β€” state-of-the-art zero-shot voice cloning.

Clone a voice in 3 steps:

  1. Navigate to the Voice tab β†’ click Clone Voice
  2. Record 5–30 seconds of clear speech (or upload a WAV/MP3)
  3. Name it β†’ click Clone This Voice β†’ select it for your session

Once selected, every TTS response uses your cloned voice. Voice selection is sent to the backend via the WebSocket set_voice message.

Voice REST API

# Clone a voice from audio
curl -X POST http://localhost:8000/api/v1/voices/clone \
  -F "audio=@my_voice.wav" -F "name=My Voice" -F "language=en"

# List all voice profiles
curl http://localhost:8000/api/v1/voices/

# Delete a voice profile
curl -X DELETE http://localhost:8000/api/v1/voices/{voice_id}

πŸ“‘ API Reference

Authentication

# Register a new user
POST /api/v1/users/register
{ "email": "user@example.com", "username": "alice", "password": "secret" }

# Login (returns JWT access token)
POST /api/v1/users/login    (form: username=... password=...)

# Use token on all protected routes
Authorization: Bearer <access_token>

Avatars

POST   /api/v1/avatars/upload        Upload avatar image (multipart: file + name)
GET    /api/v1/avatars/              List all avatars
GET    /api/v1/avatars/{id}          Get avatar details
DELETE /api/v1/avatars/{id}          Delete avatar

Voice Cloning

POST   /api/v1/voices/clone          Clone voice from audio sample
GET    /api/v1/voices/               List all voice profiles
GET    /api/v1/voices/{id}           Get voice profile
DELETE /api/v1/voices/{id}           Delete voice profile

Sessions & Messages

POST   /api/v1/sessions/create       Create session  { "avatar_id": "..." }
POST   /api/v1/sessions/{id}/end     End a session
GET    /api/v1/messages/session/{id} Get conversation history
POST   /api/v1/messages/send         Send a message (REST fallback)

WebSocket Real-Time Channel

WS  /ws/session/{session_id}

Client β†’ Server:

{ "type": "text",      "text": "Hello!" }
{ "type": "audio",     "audio": "<base64-encoded-webm>" }
{ "type": "set_voice", "voice_wav_path": "/path/to/speaker.wav" }
{ "type": "ping" }

Server β†’ Client:

{ "type": "transcription", "text": "Hello!" }
{ "type": "message",       "content": "Hi there!", "role": "assistant" }
{ "type": "video",         "video_url": "http://localhost:8000/uploads/video.mp4" }
{ "type": "status",        "message": "Generating response…", "stage": "llm" }
{ "type": "error",         "message": "Something went wrong" }
{ "type": "pong" }

βš™οΈ Configuration

All settings are in .env. Key variables:

# ── LLM ──────────────────────────────────────────────────────────────
LLM_PROVIDER=anthropic            # anthropic | openai | ollama
LLM_MODEL=claude-sonnet-4-20250514
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...

# ── Avatar Animation Engine ───────────────────────────────────────────
AVATAR_ENGINE=musetalk           # musetalk | simple

# ── TTS ──────────────────────────────────────────────────────────────
TTS_PROVIDER=coqui                # coqui (XTTS v2) | elevenlabs | bark
ELEVENLABS_API_KEY=...

# ── STT ──────────────────────────────────────────────────────────────
WHISPER_MODEL=base                # tiny | base | small | medium | large-v3

# ── Storage ──────────────────────────────────────────────────────────
USE_LOCAL_STORAGE=true            # false = AWS S3
LOCAL_STORAGE_PATH=uploads
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
S3_BUCKET_NAME=...

# ── Auth ─────────────────────────────────────────────────────────────
SECRET_KEY=change-me-in-production
ACCESS_TOKEN_EXPIRE_MINUTES=1440

# ── Observability ────────────────────────────────────────────────────
SENTRY_DSN=https://...@sentry.io/...
PROMETHEUS_ENABLED=true
LOG_LEVEL=INFO

πŸ› οΈ Tech Stack

Frontend

Library Purpose
Next.js 14 + React 18 App framework
TypeScript 5 Type safety
Tailwind CSS Styling
TanStack Query Server state
Zustand Global client state
Canvas API Real-time waveform visualizer

Backend

Library Purpose
FastAPI Async REST API + WebSocket
SQLAlchemy 2 (async) ORM with asyncpg
PostgreSQL 15 Primary database
Alembic Database migrations
Redis 7 Cache + task queue
Celery Background tasks
Prometheus + Sentry Metrics + error tracking

AI / ML

Model Purpose
Claude / GPT-4o / Llama 3 LLM conversation
Whisper (faster-whisper) Speech-to-text
XTTS v2 (Coqui TTS) Text-to-speech + zero-shot voice cloning
MuseTalk V1.5 Photorealistic lip-sync video generation
MuseTalk Alternative lip-sync animator

🚒 Deployment

Self-hosted Docker

docker compose -f docker-compose.prod.yml up -d

AWS (ECS Fargate + Terraform)

cd infrastructure
terraform init
terraform apply -var="environment=production"
./deploy.sh production

Set USE_LOCAL_STORAGE=false and add S3 credentials for cloud storage in production.


πŸ§ͺ Running Tests

cd backend
pip install -r requirements.txt
pytest -v                        # run all tests
pytest tests/test_health.py      # specific module
pytest --cov=app --cov-report=html  # with HTML coverage report

πŸ—ΊοΈ Roadmap

  • Embeddable avatar widget β€” drop a talking avatar into any website with 3 lines of JS
  • Streaming LLM β€” start TTS before LLM finishes (token-by-token)
  • Emotion-driven animation β€” detected emotion changes the avatar's facial expression
  • Multi-avatar conversations β€” two avatars talking to each other
  • Long-term memory β€” RAG + vector DB for persistent conversation context
  • Mobile app β€” React Native iOS/Android client
  • Avatar marketplace β€” share and download community-created avatars
  • Video call integration β€” replace your face in a live Zoom/Meet call

❓ FAQ

Q: Do I need a GPU? A: No β€” everything runs on CPU (slower). An NVIDIA GPU with 8GB+ VRAM is recommended for real-time performance (~200ms latency).

Q: Can I use it with no API key at all? A: Yes β€” set LLM_PROVIDER=ollama and run Ollama locally. Fully offline, fully free.

Q: How long does voice cloning take? A: XTTS v2 creates the speaker embedding in ~2 seconds from a 10-second sample. Each TTS call is then ~500ms on GPU.

Q: What file formats are supported for avatar photos? A: JPEG, PNG, WebP. A clear frontal face photo gives the best lip-sync results.

Q: Is this production-ready? A: The platform includes JWT auth, rate limiting, security headers, Sentry error tracking, Prometheus metrics, Alembic migrations, and a full test suite. Ready for private/internal production deployment.


🀝 Contributing

Contributions are welcome! Please read CONTRIBUTING.md before opening a PR. This project follows Conventional Commits.

# Fork & clone
git clone https://github.com/YOUR_USERNAME/ai-avatar-system.git
cd ai-avatar-system

# Create a feature branch
git checkout -b feat/my-feature

# Make changes, write tests, commit
git commit -m "feat(backend): add my feature"

# Push & open a pull request
git push origin feat/my-feature

Look for issues tagged good first issue to get started.


πŸ“„ License

MIT Β© 2025 β€” see LICENSE for details.


If AvatarAI saves you time or inspires your project, please ⭐ star the repo β€” it helps others find it.



Built with FastAPI Β· Next.js Β· MuseTalk Β· XTTS v2 Β· Whisper Β· Claude AI

About

🎭 Open-source AI avatar platform β€” upload a photo, clone a voice, talk to any face in real time. Lip-sync video, voice cloning, WebSocket streaming. Powered by Claude, Whisper & SadTalker.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors