This is a modified version of Sesame's CSM (Conversational Speech Model) that adds voice cloning capabilities and runs on Windows WSL with CPU-only (no GPU required).
CSM generates high-quality speech from text using the CSM-1B model and Llama-3.2-1B backbone.
- ✅ Voice cloning support via
voice_clone.py - ✅ CPU-only execution on Windows WSL
- ✅ Optimized dependencies for CPU builds
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3 python3-venv python3-pip git build-essential \
libsndfile1 ffmpegWhy these packages?
libsndfile1,ffmpeg→ Required fortorchaudioandpydubbuild-essential→ Compilation tools for Python packages
mkdir -p ~/code && cd ~/code
git clone <your-csm-repo-url> csm
cd csm/mnt/c/... paths. Always keep the project in your Linux home directory (~) for better speed and stability.
python3 -m venv .venv
source .venv/bin/activateYou should see (.venv) in your shell prompt.
At the top of requirements.txt, add these lines:
--index-url https://download.pytorch.org/whl/cpu
--extra-index-url https://pypi.org/simple
Then ensure it contains:
torch==2.4.1
torchaudio==2.4.1
tokenizers==0.21.0
transformers==4.49.0
huggingface_hub==0.28.1
moshi==0.2.2
torchtune==0.4.0
torchao==0.9.0
silentcipher @ git+https://github.com/SesameAILabs/silentcipher@master
python -m pip install --upgrade pip setuptools wheel certifiThis will pull CPU-only wheels for PyTorch:
python -m pip install --no-cache-dir -r requirements.txtpython - <<'PY'
import torch, torchaudio, soundfile
print("torch:", torch.__version__)
print("torchaudio:", torchaudio.__version__)
print("cuda available:", torch.cuda.is_available())
print("torch.version.cuda:", torch.version.cuda)
print("libsndfile OK")
PY✅ Expected output:
torch: 2.4.1+cpu
torchaudio: 2.4.1
cuda available: False
torch.version.cuda: None
libsndfile OK
export NO_TORCH_COMPILE=1Add this to your ~/.bashrc to make it permanent:
echo 'export NO_TORCH_COMPILE=1' >> ~/.bashrc
source ~/.bashrcYou need access to:
huggingface-cli loginEnter your Hugging Face token when prompted.
Before you can clone your voice, you need to prepare a voice sample:
-
Record a 30-second audio sample of your voice
- Use any recording app (phone, computer, etc.)
- Speak naturally and clearly
- Choose content that represents your normal speaking style
- Save as
.wavor.m4aformat
-
Create a transcript of exactly what you said in the recording
- This should match the audio word-for-word
- Accuracy is important for better cloning results
-
Place your audio file in the
data/folder- Example:
data/my_voice_sample.wav
- Example:
-
Copy the example config:
cp data/voice_clone_config.example.json data/voice_clone_config.json
-
Edit
data/voice_clone_config.json:{ "voice_prompt_file": "data/my_voice_sample.wav", "prompt_transcript": "Your exact transcript here...", "voiceover_script": [ "First sentence to generate in your cloned voice.", "Second sentence to generate.", "Add as many as you need." ] }
python voice_clone.pyThis will:
- Load the CSM-1B model
- Use your voice prompt audio (configure in
data/voice_clone_config.json) - Generate cloned speech from text input
Edit data/voice_clone_config.json to customize:
- Voice prompt audio file
- Text to generate
- Output settings
- Model parameters
Set the device explicitly:
export CUDA_VISIBLE_DEVICES=""This is expected. CPU inference is significantly slower than GPU. For faster generation:
- Use shorter text prompts
- Reduce
max_audio_length_ms - Consider cloud GPU options if speed is critical
Install the missing library:
sudo apt install -y libsndfile1Install ffmpeg:
sudo apt install -y ffmpegThis project is based on CSM by Sesame AI Labs. Please refer to the LICENSE file for terms and conditions.
This tool provides high-quality voice cloning capabilities. Please use it responsibly:
- ✅ Get explicit consent before cloning someone's voice
- ❌ Do not use for impersonation, fraud, or deception
- ❌ Do not create misleading or harmful content
- ❌ Do not violate any laws or regulations
You are responsible for how you use this technology. Use it ethically and legally.
- Original CSM: Sesame AI Labs
- Authors: Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team
- Modifications: Voice cloning and CPU support added by this fork