Skip to content

TianqiZhang/csm

 
 

Repository files navigation

CSM Voice Clone - CPU Edition

This is a modified version of Sesame's CSM (Conversational Speech Model) that adds voice cloning capabilities and runs on Windows WSL with CPU-only (no GPU required).

CSM generates high-quality speech from text using the CSM-1B model and Llama-3.2-1B backbone.

What's New

  • Voice cloning support via voice_clone.py
  • CPU-only execution on Windows WSL
  • Optimized dependencies for CPU builds

Full Setup Guide: WSL (CPU-only)

1. Install system dependencies

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3 python3-venv python3-pip git build-essential \
                    libsndfile1 ffmpeg

Why these packages?

  • libsndfile1, ffmpeg → Required for torchaudio and pydub
  • build-essential → Compilation tools for Python packages

2. Clone the repo into Linux home

mkdir -p ~/code && cd ~/code
git clone <your-csm-repo-url> csm
cd csm

⚠️ Important: Avoid /mnt/c/... paths. Always keep the project in your Linux home directory (~) for better speed and stability.

3. Create & activate a virtual environment

python3 -m venv .venv
source .venv/bin/activate

You should see (.venv) in your shell prompt.

4. Fix requirements.txt for CPU builds

At the top of requirements.txt, add these lines:

--index-url https://download.pytorch.org/whl/cpu
--extra-index-url https://pypi.org/simple

Then ensure it contains:

torch==2.4.1
torchaudio==2.4.1
tokenizers==0.21.0
transformers==4.49.0
huggingface_hub==0.28.1
moshi==0.2.2
torchtune==0.4.0
torchao==0.9.0
silentcipher @ git+https://github.com/SesameAILabs/silentcipher@master

5. Upgrade pip tooling inside venv

python -m pip install --upgrade pip setuptools wheel certifi

6. Install dependencies

This will pull CPU-only wheels for PyTorch:

python -m pip install --no-cache-dir -r requirements.txt

7. Sanity check your environment

python - <<'PY'
import torch, torchaudio, soundfile
print("torch:", torch.__version__)
print("torchaudio:", torchaudio.__version__)
print("cuda available:", torch.cuda.is_available())
print("torch.version.cuda:", torch.version.cuda)
print("libsndfile OK")
PY

Expected output:

torch: 2.4.1+cpu
torchaudio: 2.4.1
cuda available: False
torch.version.cuda: None
libsndfile OK

8. Disable Mimi lazy compilation

export NO_TORCH_COMPILE=1

Add this to your ~/.bashrc to make it permanent:

echo 'export NO_TORCH_COMPILE=1' >> ~/.bashrc
source ~/.bashrc

9. Login to Hugging Face

You need access to:

huggingface-cli login

Enter your Hugging Face token when prompted.


Running Voice Clone

Prepare Your Voice Sample

Before you can clone your voice, you need to prepare a voice sample:

  1. Record a 30-second audio sample of your voice

    • Use any recording app (phone, computer, etc.)
    • Speak naturally and clearly
    • Choose content that represents your normal speaking style
    • Save as .wav or .m4a format
  2. Create a transcript of exactly what you said in the recording

    • This should match the audio word-for-word
    • Accuracy is important for better cloning results
  3. Place your audio file in the data/ folder

    • Example: data/my_voice_sample.wav

Configure Voice Clone Settings

  1. Copy the example config:

    cp data/voice_clone_config.example.json data/voice_clone_config.json
  2. Edit data/voice_clone_config.json:

    {
      "voice_prompt_file": "data/my_voice_sample.wav",
      "prompt_transcript": "Your exact transcript here...",
      "voiceover_script": [
        "First sentence to generate in your cloned voice.",
        "Second sentence to generate.",
        "Add as many as you need."
      ]
    }

Basic Usage

python voice_clone.py

This will:

  1. Load the CSM-1B model
  2. Use your voice prompt audio (configure in data/voice_clone_config.json)
  3. Generate cloned speech from text input

Configuration

Edit data/voice_clone_config.json to customize:

  • Voice prompt audio file
  • Text to generate
  • Output settings
  • Model parameters

Troubleshooting

Issue: torch.cuda.is_available() returns True but you want CPU

Set the device explicitly:

export CUDA_VISIBLE_DEVICES=""

Issue: Slow performance on CPU

This is expected. CPU inference is significantly slower than GPU. For faster generation:

  • Use shorter text prompts
  • Reduce max_audio_length_ms
  • Consider cloud GPU options if speed is critical

Issue: ImportError: libsndfile.so

Install the missing library:

sudo apt install -y libsndfile1

Issue: ffmpeg errors

Install ffmpeg:

sudo apt install -y ffmpeg

License

This project is based on CSM by Sesame AI Labs. Please refer to the LICENSE file for terms and conditions.

Ethical Use ⚠️

This tool provides high-quality voice cloning capabilities. Please use it responsibly:

  • Get explicit consent before cloning someone's voice
  • Do not use for impersonation, fraud, or deception
  • Do not create misleading or harmful content
  • Do not violate any laws or regulations

You are responsible for how you use this technology. Use it ethically and legally.


Credits

  • Original CSM: Sesame AI Labs
  • Authors: Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team
  • Modifications: Voice cloning and CPU support added by this fork

About

A Conversational Speech Generation Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%