AI-powered studio utilities for video production.
- TTS (Text-to-Speech) - Voice synthesis with cloning support (Qwen3-TTS)
- Face Embedding - Extract face embeddings for IP-Adapter FaceID (InsightFace)
- Transcription - Audio transcription with word-level timestamps (Whisper)
- Modular backends - Pluggable architecture for each capability
- GPU accelerated - CUDA support for fast inference
| Capability | Endpoint | Description |
|---|---|---|
| TTS | POST /v1/tts/synthesize |
Synthesize speech from text |
| TTS | POST /v1/tts/extract |
Extract reusable voice prompt |
| TTS | GET /v1/tts/speakers |
List available speakers |
| Face | POST /v1/face/embed |
Extract face embedding from image |
| Face | POST /v1/face/embed-all |
Extract all faces from image |
| Face | POST /v1/face/compare |
Compare two face embeddings |
| Transcription | POST /v1/transcribe |
Transcribe audio with word timings |
# Install dependencies
pip install -r requirements.txt
# Run server (loads all backends)
python server.py
# Run with specific backends disabled
FACE_ENABLED=false TRANSCRIPTION_ENABLED=false python server.pySynthesize speech from text.
Parameters:
text(required): Text to synthesizelanguage: Target language (default: "English")speaker: Preset speaker for basic TTS (e.g., "Vivian", "Ryan")ref_audio: Reference audio file for voice cloning (on-the-fly)ref_text: Transcript of reference audio (improves cloning quality)voice_prompt: Pre-extracted voice prompt from/v1/tts/extract(cached)speed: Speech speed multiplier (default: 1.0)
Examples:
# Basic TTS with default speaker
curl -X POST http://localhost:8000/v1/tts/synthesize \
-F "text=Hello, world!" \
-o output.wav
# Voice cloning with reference audio
curl -X POST http://localhost:8000/v1/tts/synthesize \
-F "text=Hello, this is my cloned voice." \
-F "ref_audio=@reference.wav" \
-F "ref_text=This is the transcript of my reference audio." \
-o cloned.wavExtract a reusable voice prompt from reference audio.
VOICE_PROMPT=$(curl -X POST http://localhost:8000/v1/tts/extract \
-F "ref_audio=@reference.wav" \
-F "ref_text=This is the transcript." \
| jq -r '.voice_prompt')
# Reuse for multiple synthesis requests
curl -X POST http://localhost:8000/v1/tts/synthesize \
-F "text=First sentence." \
-F "voice_prompt=$VOICE_PROMPT" \
-o output.wavExtract face embedding from an image for IP-Adapter FaceID (PerformerDNA).
curl -X POST http://localhost:8000/v1/face/embed \
-F "image=@portrait.jpg" \
-F "return_bbox=true"Response:
{
"embedding": "base64-encoded-512-dim-vector",
"embedding_dim": 512,
"confidence": 0.98,
"bbox": [100, 50, 300, 350]
}Extract embeddings for all faces in an image.
curl -X POST http://localhost:8000/v1/face/embed-all \
-F "image=@group-photo.jpg" \
-F "max_faces=5"Compare two face embeddings for similarity.
curl -X POST http://localhost:8000/v1/face/compare \
-H "Content-Type: application/json" \
-d '{"embedding1": "...", "embedding2": "..."}'Response:
{
"similarity": 0.85,
"same_person": true
}Transcribe audio with word-level timestamps for lip-sync alignment.
curl -X POST http://localhost:8000/v1/transcribe \
-F "audio=@speech.wav" \
-F "word_timestamps=true"Response:
{
"text": "Hello, this is a test.",
"language": "en",
"duration": 2.5,
"word_timings": [
{"word": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.98},
{"word": "this", "start": 0.5, "end": 0.7, "confidence": 0.95},
...
]
}| Variable | Default | Description |
|---|---|---|
TTS_BACKEND |
qwen3-tts |
TTS backend (qwen3-tts, mock) |
FACE_ENABLED |
true |
Enable face embedding backend |
FACE_BACKEND |
insightface |
Face backend to use |
TRANSCRIPTION_ENABLED |
true |
Enable transcription backend |
TRANSCRIPTION_BACKEND |
whisper |
Transcription backend to use |
For local development without GPU, use the mock TTS backend:
TTS_BACKEND=mock FACE_ENABLED=false TRANSCRIPTION_ENABLED=false python server.py# Build
docker build -t studio-server .
# Run with GPU (all backends)
docker run --gpus all -p 8000:8000 studio-server
# Run TTS only
docker run --gpus all -p 8000:8000 \
-e FACE_ENABLED=false \
-e TRANSCRIPTION_ENABLED=false \
studio-serverstudio-server/
├── server.py # FastAPI application
├── backends/
│ ├── __init__.py
│ ├── base.py # Base Backend class
│ ├── tts.py # TTS backends (Qwen3-TTS)
│ ├── face.py # Face backends (InsightFace)
│ └── transcription.py # Transcription backends (Whisper)
├── tests/
├── requirements.txt
├── Dockerfile
└── README.md
Each backend type has its own module. To add a new backend:
# backends/tts.py
class MyTTSBackend(TTSBackend):
def load(self) -> None:
# Load your model
pass
def synthesize(self, text, language, speaker, ref_audio, ref_text, speed):
# Generate audio
return wav_bytes, sample_rate
def get_info(self) -> dict:
return {"backend": "my-tts", ...}
def get_speakers(self) -> List[str]:
return ["speaker1", "speaker2"]
# Register in TTS_BACKENDS dict
TTS_BACKENDS["my-tts"] = MyTTSBackendFor backwards compatibility, the old TTS endpoints are still available:
GET /v1/speakers→/v1/tts/speakersPOST /v1/voice/extract→/v1/tts/extractPOST /v1/audio/speech→/v1/tts/synthesize
MIT