text2speech

The text2speech module provides text-to-speech (TTS) functionality for robotics and other applications. It supports asynchronous text-to-speech generation, thread-safe audio queueing, and robust audio playback.

Although initially designed to use ElevenLabs, this implementation now relies on the Kokoro model for speech synthesis, featuring an advanced audio queue manager for conflict-free playback.

Badges

Features

✅ Thread-safe audio queue - Prevents ALSA/PortAudio conflicts with serialized playback
✅ Asynchronous text-to-speech synthesis
✅ Uses Kokoro-82M for natural-sounding voices (Apache 2.0 licensed)
✅ Priority-based message queueing
✅ Automatic duplicate message detection
✅ YAML-based configuration system
✅ Automatic resampling and volume normalization for playback
✅ Safe, thread-based audio playback
✅ Support for multiple languages and voices
✅ Command-line interface
✅ Comprehensive test suite with >90% coverage
⚙️ Legacy ElevenLabs integration retained for backward compatibility (disabled by default)

Installation

From Source

Clone the repository and install dependencies:

git clone https://github.com/dgaida/text2speech.git
cd text2speech
pip install -r requirements.txt

Optional Dependencies

For development and testing:

pip install pytest pytest-cov ruff black mypy bandit

If you want optional support for ElevenLabs (legacy mode):

pip install elevenlabs

Quick Start

Basic Usage with Queue (Recommended)

from text2speech import Text2Speech

# Initialize the TTS system (queue enabled by default)
tts = Text2Speech(el_api_key="dummy_key", verbose=True)

# Queue messages for playback (non-blocking)
tts.speak("Hello, this is your robot speaking!")
tts.speak("This message will play after the first one.")

# High-priority urgent message
tts.speak("Warning: Low battery!", priority=10)

# Cleanup when done
tts.shutdown()

Blocking Mode (Wait for Completion)

from text2speech import Text2Speech

tts = Text2Speech(el_api_key="dummy_key")

# Wait for speech to complete before continuing
tts.speak("Please wait for this message.", blocking=True)
print("Message finished!")

tts.shutdown()

Legacy Async Mode (Without Queue)

from text2speech import Text2Speech

# Disable queue for legacy threading behavior
tts = Text2Speech(el_api_key="dummy_key", enable_queue=False)

# Generate and play speech asynchronously
thread = tts.call_text2speech_async("Hello, world!")
thread.join()  # Wait for speech playback to complete

Configuration File

Create a config.yaml file:

audio:
  output_device: null  # null = system default
  default_volume: 0.8
  sample_rate: 24000

tts:
  engine: "kokoro"
  kokoro:
    lang_code: "a"  # 'a' = American, 'b' = British
    voice: "af_heart"  # See voice options below
    speed: 1.0

logging:
  verbose: false
  log_level: "INFO"

performance:
  use_gpu: true

Then use it:

from text2speech import Text2Speech

tts = Text2Speech(config_path="config.yaml")
tts.speak("Configured speech!")
tts.shutdown()

Command-Line Interface

# Basic usage
text2speech "Hello, world!"

# With custom voice
text2speech "Hello" --voice am_adam

# With custom config
text2speech "Hello" --config my_config.yaml

Available Voices

American English (`lang_code: "a"`)

af_heart - Female, warm and clear (default)
af_nicole - Female, professional
am_adam - Male, deep and authoritative
am_michael - Male, friendly

British English (`lang_code: "b"`)

bf_emma - Female, elegant
bf_isabella - Female, sophisticated
bm_lewis - Male, refined
bm_george - Male, distinguished

Voice Selection

tts = Text2Speech(el_api_key="dummy_key")

# Change voice at runtime
tts.set_voice("am_adam")
tts.speak("Speaking with Adam's voice")

# Adjust speed (0.5 to 2.0)
tts.set_speed(1.2)

# Adjust volume (0.0 to 1.0)
tts.set_volume(0.7)

tts.shutdown()

Audio Queue Features

The audio queue manager prevents ALSA/PortAudio device conflicts by serializing audio playback.

Key Features

Priority Queue: Urgent messages play first
Duplicate Detection: Skips repeated messages within timeout window
Non-blocking: Queue messages and continue execution
Statistics Tracking: Monitor queue performance
Automatic Cleanup: Graceful shutdown handling

Queue Statistics

tts = Text2Speech(el_api_key="dummy_key")

# Queue several messages
tts.speak("Message 1")
tts.speak("Message 2")
tts.speak("Urgent!", priority=10)

# Check statistics
stats = tts.get_queue_stats()
print(stats)
# {
#     'messages_queued': 3,
#     'messages_played': 1,
#     'messages_skipped_duplicate': 0,
#     'messages_skipped_full': 0,
#     'errors': 0
# }

tts.shutdown()

Custom Queue Settings

from text2speech import Text2Speech

tts = Text2Speech(
    el_api_key="dummy_key",
    enable_queue=True,
    max_queue_size=100,  # Larger queue
    duplicate_timeout=5.0  # 5 second duplicate detection window
)

tts.speak("Custom queue settings")
tts.shutdown()

Running Examples

The main.py file contains several example use cases:

# Run all examples
python main.py

# Run with verbose output
python main.py --verbose

# Run a specific example (1-5)
python main.py --example 3

# Run interactive mode
python main.py --interactive

Available Examples

Simple Greeting - Basic TTS demonstration
Multiple Sentences - Sequential speech generation
Multilingual - Speaking in different languages
Long Text - Handling longer passages
Interactive Mode - User input to speech

Testing

See TESTING.md.

Architecture

Text-to-Speech Pipeline with Queue

User Input → Text2Speech → AudioQueueManager → Worker Thread →
Kokoro Model → Audio Tensor → Resampling → Volume Normalization →
Audio Playback

Key Components

Text2Speech: Main class coordinating TTS operations
AudioQueueManager: Thread-safe priority queue for audio playback
Config: YAML-based configuration management
Kokoro Pipeline: Speech synthesis engine (82M parameters)
Audio Processing: Resampling and normalization
Safe Playback: Thread-safe audio output with error handling

Advanced Usage

Multiple TTS Instances

from text2speech import Text2Speech

# Robot voice
robot_tts = Text2Speech(el_api_key="dummy_key")
robot_tts.set_voice("am_adam")
robot_tts.set_speed(1.1)

# Narrator voice
narrator_tts = Text2Speech(el_api_key="dummy_key")
narrator_tts.set_voice("bm_lewis")
narrator_tts.set_speed(0.95)

robot_tts.speak("I am a robot.")
narrator_tts.speak("The narrator speaks.")

robot_tts.shutdown()
narrator_tts.shutdown()

Context Manager Support

from text2speech import Text2Speech

with Text2Speech(el_api_key="dummy_key") as tts:
    tts.speak("Automatic cleanup!")
    # Shutdown called automatically

Adjusting Voice and Speed

Modify the _text2speech_kokoro method to change voice characteristics:

tts.set_voice('af_heart')  # Change voice
tts.set_speed(1.2)         # Adjust speed (0.5 - 2.0)
tts.set_volume(0.8)        # Adjust volume (0.0 - 1.0)

Development

Code Quality Tools

The project uses several tools to maintain code quality:

# Format code with Black
black .

# Lint with Ruff
ruff check .

# Type checking with mypy
mypy text2speech --ignore-missing-imports

# Security scanning with Bandit
bandit -r text2speech/

Pre-commit Hooks

Install pre-commit hooks for automatic code quality checks:

pip install pre-commit
pre-commit install

CI/CD Pipeline

The project includes GitHub Actions workflows for:

🔍 Code quality checks (Ruff, Black, mypy)
🧪 Automated testing across multiple Python versions and OS
🔒 Security scanning (CodeQL, Bandit)
📦 Dependency review
🚀 Automated releases

Troubleshooting

See troubleshooting.md.

System Requirements

Python: 3.9 or higher
Operating Systems: Ubuntu, Windows, macOS
Audio: System with audio output device
Memory: Minimum 2GB RAM recommended, 4GB for optimal performance
Disk Space: ~500MB for model files
GPU (optional): CUDA-capable GPU for faster inference

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Kokoro-82M: For providing the excellent open-source TTS model (Apache 2.0)
PyTorch: For the deep learning framework
sounddevice: For audio playback capabilities
ElevenLabs: For initial inspiration (legacy support)

Contact

Daniel Gaida
Email: daniel.gaida@th-koeln.de
GitHub: @dgaida

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
tests		tests
text2speech		text2speech
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
main.py		main.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt

License

dgaida/text2speech

Folders and files

Latest commit

History

Repository files navigation

text2speech

Badges

Features

Installation

From Source

Optional Dependencies

Quick Start

Basic Usage with Queue (Recommended)

Blocking Mode (Wait for Completion)

Legacy Async Mode (Without Queue)

Configuration File

Command-Line Interface

Available Voices

American English (lang_code: "a")

British English (lang_code: "b")

Voice Selection

Audio Queue Features

Key Features

Queue Statistics

Custom Queue Settings

Running Examples

Available Examples

Testing

Architecture

Text-to-Speech Pipeline with Queue

Key Components

Advanced Usage

Multiple TTS Instances

Context Manager Support

Adjusting Voice and Speed

Development

Code Quality Tools

Pre-commit Hooks

CI/CD Pipeline

Troubleshooting

System Requirements

License

Acknowledgments

Contact

Roadmap

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

American English (`lang_code: "a"`)

British English (`lang_code: "b"`)

Packages