Skip to content

Render-AI/cog-neutts-air

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeuTTS Air - Replicate Model

This is a Replicate implementation of NeuTTS Air, a state-of-the-art text-to-speech model with instant voice cloning capabilities.

Features

  • 🗣 Best-in-class realism - Ultra-realistic, natural-sounding voices
  • 👫 Instant voice cloning - Clone any voice with just 3-15 seconds of audio
  • 🚄 Real-time generation - Fast inference on CPU
  • 📱 On-device ready - Optimized for lightweight deployment

Model Details

  • Architecture: Built on Qwen 0.5B LLM backbone
  • Audio Codec: NeuCodec proprietary neural audio codec
  • Sample Rate: 24kHz output
  • Inference: CPU-optimized

Setup

Prerequisites

Make sure you have Cog installed:

sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog

Download the weights

huggingface-cli download neuphonic/neutts-air --local-dir checkpoints

Build the model

cog build

Usage

Local prediction with Cog

cog predict \
  -i input_text="Hello, this is a test of the NeuTTS Air voice cloning system." \
  -i ref_audio=@samples/dave.wav \
  -i ref_text="My name is Dave, and um, I'm from London"

Python API

from cog import BasePredictor, Input, Path
import soundfile as sf
from neuttsair.neutts import NeuTTSAir

# Initialize model
tts = NeuTTSAir(
    backbone_repo="./checkpoints",
    backbone_device="cpu",
    codec_repo="neuphonic/neucodec",
    codec_device="cpu"
)

# Prepare inputs
input_text = "My name is Dave, and um, I'm from London."
ref_audio_path = "samples/dave.wav"
ref_text = "My name is Dave, and um, I'm from London."

# Generate speech
ref_codes = tts.encode_reference(ref_audio_path)
wav = tts.infer(input_text, ref_codes, ref_text)

# Save output
sf.write("output.wav", wav, 24000)

Input Parameters

  • input_text (string, required): The text to synthesize as speech
  • ref_audio (file, required): Reference audio file for voice cloning
    • Format: .wav file
    • Duration: 3-15 seconds recommended
    • Sample rate: 16-44 kHz
    • Channels: Mono preferred
    • Quality: Clean audio with minimal background noise
  • ref_text (string, optional): Transcript of what's being said in the reference audio

Output

Returns a .wav file containing the synthesized speech at 24kHz sample rate.

Tips for Best Results

  1. Use high-quality reference audio: Clean, clear recordings work best
  2. Match speaking style: The model will mimic the tone and style of the reference
  3. Provide accurate transcripts: The ref_text should match what's actually said in ref_audio
  4. Optimal reference length: 3-15 seconds captures the voice well
  5. Natural speech: Conversational audio with few pauses works best

License

Apache 2.0

Credits

Created by Neuphonic - building faster, smaller, on-device voice AI.

Links

Disclaimer

This model includes Perth (Perceptual Threshold) Watermarker for responsible AI usage. Please use this technology ethically and responsibly.

About

Cog wrapper for neuphonic/neutts-air

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%