This is a Replicate implementation of NeuTTS Air, a state-of-the-art text-to-speech model with instant voice cloning capabilities.
- 🗣 Best-in-class realism - Ultra-realistic, natural-sounding voices
- 👫 Instant voice cloning - Clone any voice with just 3-15 seconds of audio
- 🚄 Real-time generation - Fast inference on CPU
- 📱 On-device ready - Optimized for lightweight deployment
- Architecture: Built on Qwen 0.5B LLM backbone
- Audio Codec: NeuCodec proprietary neural audio codec
- Sample Rate: 24kHz output
- Inference: CPU-optimized
Make sure you have Cog installed:
sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog
huggingface-cli download neuphonic/neutts-air --local-dir checkpoints
cog build
cog predict \
-i input_text="Hello, this is a test of the NeuTTS Air voice cloning system." \
-i ref_audio=@samples/dave.wav \
-i ref_text="My name is Dave, and um, I'm from London"
from cog import BasePredictor, Input, Path
import soundfile as sf
from neuttsair.neutts import NeuTTSAir
# Initialize model
tts = NeuTTSAir(
backbone_repo="./checkpoints",
backbone_device="cpu",
codec_repo="neuphonic/neucodec",
codec_device="cpu"
)
# Prepare inputs
input_text = "My name is Dave, and um, I'm from London."
ref_audio_path = "samples/dave.wav"
ref_text = "My name is Dave, and um, I'm from London."
# Generate speech
ref_codes = tts.encode_reference(ref_audio_path)
wav = tts.infer(input_text, ref_codes, ref_text)
# Save output
sf.write("output.wav", wav, 24000)
- input_text (string, required): The text to synthesize as speech
- ref_audio (file, required): Reference audio file for voice cloning
- Format:
.wav
file - Duration: 3-15 seconds recommended
- Sample rate: 16-44 kHz
- Channels: Mono preferred
- Quality: Clean audio with minimal background noise
- Format:
- ref_text (string, optional): Transcript of what's being said in the reference audio
Returns a .wav
file containing the synthesized speech at 24kHz sample rate.
- Use high-quality reference audio: Clean, clear recordings work best
- Match speaking style: The model will mimic the tone and style of the reference
- Provide accurate transcripts: The ref_text should match what's actually said in ref_audio
- Optimal reference length: 3-15 seconds captures the voice well
- Natural speech: Conversational audio with few pauses works best
Apache 2.0
Created by Neuphonic - building faster, smaller, on-device voice AI.
This model includes Perth (Perceptual Threshold) Watermarker for responsible AI usage. Please use this technology ethically and responsibly.