Multimodal Voice Chat with AI Models

This application provides a voice chat interface powered by AI language models (Scaleway and Ollama). It features speech-to-text, LLM processing, and text-to-speech capabilities with support for multiple voice types and session management for saving conversations.

Features

Speech-to-Text: Uses Whisper model to transcribe user voice input
Dual LLM Integration:
- Scaleway AI: Connect to Scaleway's hosted LLM service
- Ollama: Connect to a self-hosted Ollama server
Text-to-Speech: Generates natural-sounding speech with:
- CSM (Conditioned Sound Model): High-quality voice for responses up to 300 characters
- gTTS (Google Text-to-Speech): Efficient fallback for longer responses
Voice Selection: Choose between Woman and Man voices with consistent mapping
Conversation Memory: Maintains context throughout the conversation
Session Management: Save, load, and manage conversation sessions with unique identifiers
GPU Acceleration: Optimized for ML tasks using CUDA if available
Responsive UI: Clear interface with service selection and voice type options

Prerequisites

For Local Deployment (`app_scaleway.py`)

Python 3.8+
CUDA-compatible GPU (recommended for CSM performance)
Scaleway API key (for Scaleway service)
Ollama server (optional, for Ollama service)

For Modal Deployment (`app_modal.py`)

A Modal account
An Ollama server endpoint for LLM access
Modal CLI installed (pip install modal)
Hugging Face account (for CSM model access)

For Scaleway Serverless Container Deployment

Scaleway account with Container Registry and Serverless Containers enabled
Docker installed locally
Scaleway CLI (optional)
Container Registry namespace

Setup Instructions for Local Deployment

Step 1: Create a Python virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 2: Install dependencies

pip install torch torchaudio transformers gradio requests sounddevice numpy soundfile openai huggingface_hub
pip install silentcipher@git+https://github.com/SesameAILabs/silentcipher@master
pip install gtts pydub  # For gTTS fallback

Step 3: Set environment variables

# Set your Scaleway API key
export SCALEWAY_API_KEY="your-scaleway-api-key"

# Optional session management configuration
export SESSION_DIR="user_sessions"  # Directory to store session files
export SESSION_RETENTION_DAYS="30"  # How long to keep session files (default: 30 days)
export CLEANUP_INTERVAL_HOURS="24"  # How often to check for old sessions (default: 24 hours)

Step 4: Run the application

python app_scaleway.py

The application will be available at http://localhost:7860 by default.

Setup Instructions for Scaleway Serverless Container Deployment

Step 1: Create a Dockerfile

Create a file named Dockerfile with the following content:

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    ffmpeg \
    libsndfile1 \
    libasound2-dev \
    portaudio19-dev \
    python3-pyaudio \
    && rm -rf /var/lib/apt/lists/*

# Copy application file
COPY app_simple_scaleway.py ./app.py

# Install packages in separate steps for better reliability
RUN pip install --no-cache-dir \
    gradio==5.23.1 \
    transformers==4.35.2 \
    numpy==1.25.2 \
    requests==2.31.0 \
    gtts==2.3.2 \
    pydub==0.25.1 \
    soundfile==0.12.1 \
    PyAudio==0.2.13

# Install PyTorch separately with its custom index
RUN pip install --no-cache-dir \
    torch==2.1.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cpu

# Pre-download models to avoid runtime downloads
RUN python -c "from transformers import WhisperProcessor, WhisperForConditionalGeneration; \
    processor = WhisperProcessor.from_pretrained('openai/whisper-base'); \
    model = WhisperForConditionalGeneration.from_pretrained('openai/whisper-base')"

# Create directories
RUN mkdir -p audio_outputs user_sessions

# Expose port
EXPOSE 7860

# Start application
CMD ["python", "app.py"]

Step 2: Build the Docker image for AMD64 architecture

docker buildx build --load --platform=linux/amd64 -t scaleway-voice-chat .

Step 3: Push to Scaleway Container Registry

# Login to Scaleway Container Registry (replace with your namespace)
docker login rg.nl-ams.scw.cloud/your-namespace -u nologin -p your-scaleway-api-key

# Tag the image
docker tag scaleway-voice-chat rg.nl-ams.scw.cloud/your-namespace/scaleway-voice-chat:latest

# Push to registry
docker push rg.nl-ams.scw.cloud/your-namespace/scaleway-voice-chat:latest

Step 4: Deploy as Serverless Container

Option 1: Using Scaleway Console

Go to the Scaleway Console
Navigate to "Serverless" → "Containers"
Click "Create Container"
Configure your container:
- Name: scaleway-voice-chat
- Container Image: Select your registry and scaleway-voice-chat:latest image
- Memory: 2GB
- CPU: 2 vCPU
- Environment Variables:
  - Name: SCALEWAY_API_KEY
  - Value: Your Scaleway API key
  - Type: Secret (ensures it's stored securely)
- Port: 7860
Click "Create Container"

Option 2: Using Scaleway CLI

# Install Scaleway CLI if needed
curl -o /usr/local/bin/scw -L "https://github.com/scaleway/scaleway-cli/releases/latest/download/scw-darwin-arm64"
chmod +x /usr/local/bin/scw
scw init

# Deploy container
scw container namespace function create \
  --namespace-id your-namespace \
  --name scaleway-voice-chat \
  --registry-image rg.nl-ams.scw.cloud/your-namespace/scaleway-voice-chat:latest \
  --memory-limit 2G \
  --port 7860 \
  --env SCALEWAY_API_KEY=your-api-key:secret \
  --region nl-ams

Step 5: Access Your Deployed Application

Once deployed, you can access your application at the URL provided by Scaleway Serverless Containers.

Using the Application

Accessing the Web Interface

Navigate to the local URL, Modal deployment URL, or Scaleway Serverless Container URL in your browser
For Modal, add /ui to the URL (e.g., https://your-username--voice-chat-app-v1-serve.modal.run/ui)
For Scaleway Serverless Container, use the URL provided in the console (e.g., https://your-function-id.functions.fnc.fr-par.scw.cloud)

Using the Voice Chat

Select AI Service: Choose between Scaleway (default) or Ollama
Select a Model: Choose from available models in the dropdown
- Scaleway models: deepseek-r1-distill-llama-70b, meta-llama-3-70b-instruct, mixtral-8x7b-instruct-v0.1
- Ollama models: mistral:latest, llama3:8b, llama3:70b, gemma3:27b, phi3:14b, mixtral:8x7b, codellama:70b
Choose Voice Type: Select "Woman" or "Man" for the AI's voice response
Record Audio: Click the microphone icon and speak your message
Process Message: Your speech will be automatically transcribed and processed
Listen to Response: The AI's response will be displayed as text and played as audio
Reset Conversation: Use the reset button to start a new conversation

Using Session Management

The application includes session management features that allow you to save and reload your conversations:

Session ID: Each conversation is associated with a unique session ID, which is randomly generated by default
Custom Session ID: You can create a memorable name for your session using the "Custom Session ID" field
- Enter a name (e.g., "work-chat" or "medical-questions")
- Click "Apply Custom ID" to assign this name to your session
Save Session: Click the "💾 Save Session" button to store your current conversation
Load Session: Click the "📂 Load Session" button to reload a previously saved conversation
Clear Session: Click the "🗑️ Clear Session" button to start fresh while keeping the same session ID

Sessions are isolated between different browser windows and tabs, allowing you to maintain multiple separate conversations simultaneously. Each session has its own conversation history that persists between page refreshes when saved.

Speech Generation Features

The application uses two speech generation methods:

CSM (Conditioned Sound Model):
- High-quality voice synthesis for responses up to 300 characters
- Consistent voice mapping (Man and Woman voices)
- Optimized parameters for faster generation
gTTS (Google Text-to-Speech):
- Used for responses longer than 300 characters
- Different voice mapping using regional accents:
  - Woman voice: US English (tld="us")
  - Man voice: UK English (tld="co.uk")

Text is sanitized before speech generation, with special handling for:

Removal of asterisks and other problematic characters
Preservation of standard punctuation for natural speech
Normalization of quotes and whitespace

Advanced Configuration Options

Local Deployment Options

You can customize the application by modifying:

# Set the port for the Gradio interface
export GRADIO_PORT=8080

# Session management configuration
export SESSION_DIR="user_sessions"  # Directory where session files are stored
export SESSION_RETENTION_DAYS="30"  # Days to keep session files before automatic deletion
export CLEANUP_INTERVAL_HOURS="24"  # How often to run the cleanup job (in hours)

# Enable debug logging
export PYTHONPATH=./  # If needed to resolve import issues

Scaleway Serverless Container Options

When deploying to Scaleway Serverless Containers, you can customize:

Environment Variables:
- SCALEWAY_API_KEY: Your Scaleway API key (set as a secret)
- SESSION_DIR: Directory where session files are stored
- SESSION_RETENTION_DAYS: Days to keep session files
- CLEANUP_INTERVAL_HOURS: How often to run cleanup
Resources:
- Memory: 2GB recommended for good performance
- vCPU: 2 vCPU recommended for faster processing
- Minimum Scale: 0 (scale to zero when not in use)
- Maximum Scale: Based on your expected traffic
Security:
- Always set API keys as Secrets in the console
- Consider adding authentication if needed
- Enable HTTPS termination (recommended)

Making the App Public

To make your Gradio app publicly accessible:

With the local deployment, set share=True in the demo.launch() call (already configured)
For Modal deployment, the app is accessible via the provided URL
For Scaleway Serverless Container, the app is automatically accessible via the provided URL

Session Storage

Session data is stored in JSON files within the SESSION_DIR directory (defaults to "user_sessions"). Each session file contains:

A hashed session ID for privacy
The conversation history (user and AI messages)
The last updated timestamp

The system automatically cleans up old session files based on the SESSION_RETENTION_DAYS setting.

Troubleshooting

Common Issues

Voice Mapping Confusion: If voices sound incorrect, verify the CSM constants at the top of the file:

SPEAKER_ID_WOMAN = 0  # UI selection "Woman" 
SPEAKER_ID_MAN = 1    # UI selection "Man"
CSM_SPEAKER_WOMAN = 0  # CSM model expects ID 0 for woman
CSM_SPEAKER_MAN = 1    # CSM model expects ID 1 for man

Audio Not Working: Ensure your browser has microphone permissions enabled
LLM Connection Errors:
- For Scaleway: Verify your API key is correctly set
- For Ollama: Verify your server URL is correct and accessible
Missing gTTS: If you see warnings about missing gTTS, install the packages:
```
pip install gtts pydub
```
Slow Speech Generation: For responses close to the 300-character threshold, the CSM model may take time to generate audio. The application displays text immediately while audio is being prepared.
Session Management Issues:
- If session loading fails, check if the session file exists in the SESSION_DIR directory
- If custom IDs don't work, ensure there are no special characters in your ID
- If sessions are shared between browsers, ensure you're using a different Custom Session ID for each
Serverless Container Issues:
- Container not starting: Check for logs in the Scaleway console
- Long cold start times: Pre-downloading models in the Dockerfile helps
- Memory errors: Increase the memory allocation
- Microphone not working: HTTPS is required for microphone access, ensure it's enabled

Architecture

The application consists of the following components:

Whisper Model: Converts speech to text
LLM Integration:
- Scaleway API for cloud-based LLMs
- Ollama for self-hosted LLMs
CSM Model: High-quality text-to-speech using Sesame's Conversational Speech Model
gTTS Fallback: Faster alternative for long responses
Session Management: Stores and manages conversation histories
Gradio UI: Provides intuitive web interface

The application is structured with modular architecture:

app_scaleway.py: Main application file for local deployment
app_scaleway_modal.py: Version optimized for Modal deployment
app_simple_scaleway.py: Simplified version for Scaleway Serverless Containers
audio_utils.py: Audio processing and transcription
llm_services.py: Connects to language models
text_utils.py: Text processing and formatting
tts_service.py: Text-to-speech functionality
session_management.py: Session storage and retrieval

License

This project is provided as-is for educational and demonstration purposes.

Credits

Developed based on CSM (Conversational Speech Model) from Sesame, which generates high-quality speech from text. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
app_modal.py		app_modal.py
app_scaleway.py		app_scaleway.py
app_scaleway_modal.py		app_scaleway_modal.py
app_simple_scaleway.py		app_simple_scaleway.py
audio_utils.py		audio_utils.py
generator.py		generator.py
generator_stub.py		generator_stub.py
llm_services.py		llm_services.py
models.py		models.py
requirements.txt		requirements.txt
session_management.py		session_management.py
text_utils.py		text_utils.py
tts_service.py		tts_service.py
watermarking.py		watermarking.py

stevef1uk/csm_demo

Folders and files

Latest commit

History

Repository files navigation