Studio Server

AI-powered studio utilities for video production.

Features

TTS (Text-to-Speech) - Voice synthesis with cloning support (Qwen3-TTS)
Face Embedding - Extract face embeddings for IP-Adapter FaceID (InsightFace)
Transcription - Audio transcription with word-level timestamps (Whisper)
Modular backends - Pluggable architecture for each capability
GPU accelerated - CUDA support for fast inference

API Overview

Capability	Endpoint	Description
TTS	`POST /v1/tts/synthesize`	Synthesize speech from text
TTS	`POST /v1/tts/extract`	Extract reusable voice prompt
TTS	`GET /v1/tts/speakers`	List available speakers
Face	`POST /v1/face/embed`	Extract face embedding from image
Face	`POST /v1/face/embed-all`	Extract all faces from image
Face	`POST /v1/face/compare`	Compare two face embeddings
Transcription	`POST /v1/transcribe`	Transcribe audio with word timings

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run server (loads all backends)
python server.py

# Run with specific backends disabled
FACE_ENABLED=false TRANSCRIPTION_ENABLED=false python server.py

TTS Endpoints

`POST /v1/tts/synthesize`

Synthesize speech from text.

Parameters:

text (required): Text to synthesize
language: Target language (default: "English")
speaker: Preset speaker for basic TTS (e.g., "Vivian", "Ryan")
ref_audio: Reference audio file for voice cloning (on-the-fly)
ref_text: Transcript of reference audio (improves cloning quality)
voice_prompt: Pre-extracted voice prompt from /v1/tts/extract (cached)
speed: Speech speed multiplier (default: 1.0)

Examples:

# Basic TTS with default speaker
curl -X POST http://localhost:8000/v1/tts/synthesize \
  -F "text=Hello, world!" \
  -o output.wav

# Voice cloning with reference audio
curl -X POST http://localhost:8000/v1/tts/synthesize \
  -F "text=Hello, this is my cloned voice." \
  -F "ref_audio=@reference.wav" \
  -F "ref_text=This is the transcript of my reference audio." \
  -o cloned.wav

`POST /v1/tts/extract`

Extract a reusable voice prompt from reference audio.

VOICE_PROMPT=$(curl -X POST http://localhost:8000/v1/tts/extract \
  -F "ref_audio=@reference.wav" \
  -F "ref_text=This is the transcript." \
  | jq -r '.voice_prompt')

# Reuse for multiple synthesis requests
curl -X POST http://localhost:8000/v1/tts/synthesize \
  -F "text=First sentence." \
  -F "voice_prompt=$VOICE_PROMPT" \
  -o output.wav

Face Embedding Endpoints

`POST /v1/face/embed`

Extract face embedding from an image for IP-Adapter FaceID (PerformerDNA).

curl -X POST http://localhost:8000/v1/face/embed \
  -F "image=@portrait.jpg" \
  -F "return_bbox=true"

Response:

{
  "embedding": "base64-encoded-512-dim-vector",
  "embedding_dim": 512,
  "confidence": 0.98,
  "bbox": [100, 50, 300, 350]
}

`POST /v1/face/embed-all`

Extract embeddings for all faces in an image.

curl -X POST http://localhost:8000/v1/face/embed-all \
  -F "image=@group-photo.jpg" \
  -F "max_faces=5"

`POST /v1/face/compare`

Compare two face embeddings for similarity.

curl -X POST http://localhost:8000/v1/face/compare \
  -H "Content-Type: application/json" \
  -d '{"embedding1": "...", "embedding2": "..."}'

Response:

{
  "similarity": 0.85,
  "same_person": true
}

Transcription Endpoints

`POST /v1/transcribe`

Transcribe audio with word-level timestamps for lip-sync alignment.

curl -X POST http://localhost:8000/v1/transcribe \
  -F "audio=@speech.wav" \
  -F "word_timestamps=true"

Response:

{
  "text": "Hello, this is a test.",
  "language": "en",
  "duration": 2.5,
  "word_timings": [
    {"word": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.98},
    {"word": "this", "start": 0.5, "end": 0.7, "confidence": 0.95},
    ...
  ]
}

Environment Variables

Variable	Default	Description
`TTS_BACKEND`	`qwen3-tts`	TTS backend (`qwen3-tts`, `mock`)
`FACE_ENABLED`	`true`	Enable face embedding backend
`FACE_BACKEND`	`insightface`	Face backend to use
`TRANSCRIPTION_ENABLED`	`true`	Enable transcription backend
`TRANSCRIPTION_BACKEND`	`whisper`	Transcription backend to use

Development Mode (No GPU)

For local development without GPU, use the mock TTS backend:

TTS_BACKEND=mock FACE_ENABLED=false TRANSCRIPTION_ENABLED=false python server.py

Docker

# Build
docker build -t studio-server .

# Run with GPU (all backends)
docker run --gpus all -p 8000:8000 studio-server

# Run TTS only
docker run --gpus all -p 8000:8000 \
  -e FACE_ENABLED=false \
  -e TRANSCRIPTION_ENABLED=false \
  studio-server

Project Structure

studio-server/
├── server.py              # FastAPI application
├── backends/
│   ├── __init__.py
│   ├── base.py            # Base Backend class
│   ├── tts.py             # TTS backends (Qwen3-TTS)
│   ├── face.py            # Face backends (InsightFace)
│   └── transcription.py   # Transcription backends (Whisper)
├── tests/
├── requirements.txt
├── Dockerfile
└── README.md

Adding New Backends

Each backend type has its own module. To add a new backend:

# backends/tts.py
class MyTTSBackend(TTSBackend):
    def load(self) -> None:
        # Load your model
        pass

    def synthesize(self, text, language, speaker, ref_audio, ref_text, speed):
        # Generate audio
        return wav_bytes, sample_rate

    def get_info(self) -> dict:
        return {"backend": "my-tts", ...}

    def get_speakers(self) -> List[str]:
        return ["speaker1", "speaker2"]

# Register in TTS_BACKENDS dict
TTS_BACKENDS["my-tts"] = MyTTSBackend

Legacy Endpoints

For backwards compatibility, the old TTS endpoints are still available:

GET /v1/speakers → /v1/tts/speakers
POST /v1/voice/extract → /v1/tts/extract
POST /v1/audio/speech → /v1/tts/synthesize

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
backends		backends
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Studio Server

Features

API Overview

Quick Start

TTS Endpoints

`POST /v1/tts/synthesize`

`POST /v1/tts/extract`

Face Embedding Endpoints

`POST /v1/face/embed`

`POST /v1/face/embed-all`

`POST /v1/face/compare`

Transcription Endpoints

`POST /v1/transcribe`

Environment Variables

Development Mode (No GPU)

Docker

Project Structure

Adding New Backends

Legacy Endpoints

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

happyvertical/studio-server

Folders and files

Latest commit

History

Repository files navigation

Studio Server

Features

API Overview

Quick Start

TTS Endpoints

POST /v1/tts/synthesize

POST /v1/tts/extract

Face Embedding Endpoints

POST /v1/face/embed

POST /v1/face/embed-all

POST /v1/face/compare

Transcription Endpoints

POST /v1/transcribe

Environment Variables

Development Mode (No GPU)

Docker

Project Structure

Adding New Backends

Legacy Endpoints

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`POST /v1/tts/synthesize`

`POST /v1/tts/extract`

`POST /v1/face/embed`

`POST /v1/face/embed-all`

`POST /v1/face/compare`

`POST /v1/transcribe`

Packages