NeuTTS Air - Replicate Model

This is a Replicate implementation of NeuTTS Air, a state-of-the-art text-to-speech model with instant voice cloning capabilities.

Features

🗣 Best-in-class realism - Ultra-realistic, natural-sounding voices
👫 Instant voice cloning - Clone any voice with just 3-15 seconds of audio
🚄 Real-time generation - Fast inference on CPU
📱 On-device ready - Optimized for lightweight deployment

Model Details

Architecture: Built on Qwen 0.5B LLM backbone
Audio Codec: NeuCodec proprietary neural audio codec
Sample Rate: 24kHz output
Inference: CPU-optimized

Setup

Prerequisites

Make sure you have Cog installed:

sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog

Download the weights

huggingface-cli download neuphonic/neutts-air --local-dir checkpoints

Build the model

cog build

Usage

Local prediction with Cog

cog predict \
  -i input_text="Hello, this is a test of the NeuTTS Air voice cloning system." \
  -i ref_audio=@samples/dave.wav \
  -i ref_text="My name is Dave, and um, I'm from London"

Python API

from cog import BasePredictor, Input, Path
import soundfile as sf
from neuttsair.neutts import NeuTTSAir

# Initialize model
tts = NeuTTSAir(
    backbone_repo="./checkpoints",
    backbone_device="cpu",
    codec_repo="neuphonic/neucodec",
    codec_device="cpu"
)

# Prepare inputs
input_text = "My name is Dave, and um, I'm from London."
ref_audio_path = "samples/dave.wav"
ref_text = "My name is Dave, and um, I'm from London."

# Generate speech
ref_codes = tts.encode_reference(ref_audio_path)
wav = tts.infer(input_text, ref_codes, ref_text)

# Save output
sf.write("output.wav", wav, 24000)

Input Parameters

input_text (string, required): The text to synthesize as speech
ref_audio (file, required): Reference audio file for voice cloning
- Format: .wav file
- Duration: 3-15 seconds recommended
- Sample rate: 16-44 kHz
- Channels: Mono preferred
- Quality: Clean audio with minimal background noise
ref_text (string, optional): Transcript of what's being said in the reference audio

Output

Returns a .wav file containing the synthesized speech at 24kHz sample rate.

Tips for Best Results

Use high-quality reference audio: Clean, clear recordings work best
Match speaking style: The model will mimic the tone and style of the reference
Provide accurate transcripts: The ref_text should match what's actually said in ref_audio
Optimal reference length: 3-15 seconds captures the voice well
Natural speech: Conversational audio with few pauses works best

License

Apache 2.0

Credits

Created by Neuphonic - building faster, smaller, on-device voice AI.

Links

Disclaimer

This model includes Perth (Perceptual Threshold) Watermarker for responsible AI usage. Please use this technology ethically and responsibly.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NeuTTS Air - Replicate Model

Features

Model Details

Setup

Prerequisites

Download the weights

Build the model

Usage

Local prediction with Cog

Python API

Input Parameters

Output

Tips for Best Results

License

Credits

Links

Disclaimer

About

Uh oh!

Releases

Packages

Languages

Render-AI/cog-neutts-air

Folders and files

Latest commit

History

Repository files navigation

NeuTTS Air - Replicate Model

Features

Model Details

Setup

Prerequisites

Download the weights

Build the model

Usage

Local prediction with Cog

Python API

Input Parameters

Output

Tips for Best Results

License

Credits

Links

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages