Skip to content

happyvertical/studio-server

Repository files navigation

Studio Server

AI-powered studio utilities for video production.

Features

  • TTS (Text-to-Speech) - Voice synthesis with cloning support (Qwen3-TTS)
  • Face Embedding - Extract face embeddings for IP-Adapter FaceID (InsightFace)
  • Transcription - Audio transcription with word-level timestamps (Whisper)
  • Modular backends - Pluggable architecture for each capability
  • GPU accelerated - CUDA support for fast inference

API Overview

Capability Endpoint Description
TTS POST /v1/tts/synthesize Synthesize speech from text
TTS POST /v1/tts/extract Extract reusable voice prompt
TTS GET /v1/tts/speakers List available speakers
Face POST /v1/face/embed Extract face embedding from image
Face POST /v1/face/embed-all Extract all faces from image
Face POST /v1/face/compare Compare two face embeddings
Transcription POST /v1/transcribe Transcribe audio with word timings

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run server (loads all backends)
python server.py

# Run with specific backends disabled
FACE_ENABLED=false TRANSCRIPTION_ENABLED=false python server.py

TTS Endpoints

POST /v1/tts/synthesize

Synthesize speech from text.

Parameters:

  • text (required): Text to synthesize
  • language: Target language (default: "English")
  • speaker: Preset speaker for basic TTS (e.g., "Vivian", "Ryan")
  • ref_audio: Reference audio file for voice cloning (on-the-fly)
  • ref_text: Transcript of reference audio (improves cloning quality)
  • voice_prompt: Pre-extracted voice prompt from /v1/tts/extract (cached)
  • speed: Speech speed multiplier (default: 1.0)

Examples:

# Basic TTS with default speaker
curl -X POST http://localhost:8000/v1/tts/synthesize \
  -F "text=Hello, world!" \
  -o output.wav

# Voice cloning with reference audio
curl -X POST http://localhost:8000/v1/tts/synthesize \
  -F "text=Hello, this is my cloned voice." \
  -F "ref_audio=@reference.wav" \
  -F "ref_text=This is the transcript of my reference audio." \
  -o cloned.wav

POST /v1/tts/extract

Extract a reusable voice prompt from reference audio.

VOICE_PROMPT=$(curl -X POST http://localhost:8000/v1/tts/extract \
  -F "ref_audio=@reference.wav" \
  -F "ref_text=This is the transcript." \
  | jq -r '.voice_prompt')

# Reuse for multiple synthesis requests
curl -X POST http://localhost:8000/v1/tts/synthesize \
  -F "text=First sentence." \
  -F "voice_prompt=$VOICE_PROMPT" \
  -o output.wav

Face Embedding Endpoints

POST /v1/face/embed

Extract face embedding from an image for IP-Adapter FaceID (PerformerDNA).

curl -X POST http://localhost:8000/v1/face/embed \
  -F "image=@portrait.jpg" \
  -F "return_bbox=true"

Response:

{
  "embedding": "base64-encoded-512-dim-vector",
  "embedding_dim": 512,
  "confidence": 0.98,
  "bbox": [100, 50, 300, 350]
}

POST /v1/face/embed-all

Extract embeddings for all faces in an image.

curl -X POST http://localhost:8000/v1/face/embed-all \
  -F "image=@group-photo.jpg" \
  -F "max_faces=5"

POST /v1/face/compare

Compare two face embeddings for similarity.

curl -X POST http://localhost:8000/v1/face/compare \
  -H "Content-Type: application/json" \
  -d '{"embedding1": "...", "embedding2": "..."}'

Response:

{
  "similarity": 0.85,
  "same_person": true
}

Transcription Endpoints

POST /v1/transcribe

Transcribe audio with word-level timestamps for lip-sync alignment.

curl -X POST http://localhost:8000/v1/transcribe \
  -F "audio=@speech.wav" \
  -F "word_timestamps=true"

Response:

{
  "text": "Hello, this is a test.",
  "language": "en",
  "duration": 2.5,
  "word_timings": [
    {"word": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.98},
    {"word": "this", "start": 0.5, "end": 0.7, "confidence": 0.95},
    ...
  ]
}

Environment Variables

Variable Default Description
TTS_BACKEND qwen3-tts TTS backend (qwen3-tts, mock)
FACE_ENABLED true Enable face embedding backend
FACE_BACKEND insightface Face backend to use
TRANSCRIPTION_ENABLED true Enable transcription backend
TRANSCRIPTION_BACKEND whisper Transcription backend to use

Development Mode (No GPU)

For local development without GPU, use the mock TTS backend:

TTS_BACKEND=mock FACE_ENABLED=false TRANSCRIPTION_ENABLED=false python server.py

Docker

# Build
docker build -t studio-server .

# Run with GPU (all backends)
docker run --gpus all -p 8000:8000 studio-server

# Run TTS only
docker run --gpus all -p 8000:8000 \
  -e FACE_ENABLED=false \
  -e TRANSCRIPTION_ENABLED=false \
  studio-server

Project Structure

studio-server/
├── server.py              # FastAPI application
├── backends/
│   ├── __init__.py
│   ├── base.py            # Base Backend class
│   ├── tts.py             # TTS backends (Qwen3-TTS)
│   ├── face.py            # Face backends (InsightFace)
│   └── transcription.py   # Transcription backends (Whisper)
├── tests/
├── requirements.txt
├── Dockerfile
└── README.md

Adding New Backends

Each backend type has its own module. To add a new backend:

# backends/tts.py
class MyTTSBackend(TTSBackend):
    def load(self) -> None:
        # Load your model
        pass

    def synthesize(self, text, language, speaker, ref_audio, ref_text, speed):
        # Generate audio
        return wav_bytes, sample_rate

    def get_info(self) -> dict:
        return {"backend": "my-tts", ...}

    def get_speakers(self) -> List[str]:
        return ["speaker1", "speaker2"]

# Register in TTS_BACKENDS dict
TTS_BACKENDS["my-tts"] = MyTTSBackend

Legacy Endpoints

For backwards compatibility, the old TTS endpoints are still available:

  • GET /v1/speakers/v1/tts/speakers
  • POST /v1/voice/extract/v1/tts/extract
  • POST /v1/audio/speech/v1/tts/synthesize

License

MIT

About

Multi-model text-to-speech API with voice cloning support

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages