The text2speech module provides text-to-speech (TTS) functionality for robotics and other applications. It supports asynchronous text-to-speech generation, thread-safe audio queueing, and robust audio playback.
Although initially designed to use ElevenLabs, this implementation now relies on the Kokoro model for speech synthesis, featuring an advanced audio queue manager for conflict-free playback.
- ✅ Thread-safe audio queue - Prevents ALSA/PortAudio conflicts with serialized playback
- ✅ Asynchronous text-to-speech synthesis
- ✅ Uses Kokoro-82M for natural-sounding voices (Apache 2.0 licensed)
- ✅ Priority-based message queueing
- ✅ Automatic duplicate message detection
- ✅ YAML-based configuration system
- ✅ Automatic resampling and volume normalization for playback
- ✅ Safe, thread-based audio playback
- ✅ Support for multiple languages and voices
- ✅ Command-line interface
- ✅ Comprehensive test suite with >90% coverage
- ⚙️ Legacy ElevenLabs integration retained for backward compatibility (disabled by default)
Clone the repository and install dependencies:
git clone https://github.com/dgaida/text2speech.git
cd text2speech
pip install -r requirements.txtFor development and testing:
pip install pytest pytest-cov ruff black mypy banditIf you want optional support for ElevenLabs (legacy mode):
pip install elevenlabsfrom text2speech import Text2Speech
# Initialize the TTS system (queue enabled by default)
tts = Text2Speech(el_api_key="dummy_key", verbose=True)
# Queue messages for playback (non-blocking)
tts.speak("Hello, this is your robot speaking!")
tts.speak("This message will play after the first one.")
# High-priority urgent message
tts.speak("Warning: Low battery!", priority=10)
# Cleanup when done
tts.shutdown()from text2speech import Text2Speech
tts = Text2Speech(el_api_key="dummy_key")
# Wait for speech to complete before continuing
tts.speak("Please wait for this message.", blocking=True)
print("Message finished!")
tts.shutdown()from text2speech import Text2Speech
# Disable queue for legacy threading behavior
tts = Text2Speech(el_api_key="dummy_key", enable_queue=False)
# Generate and play speech asynchronously
thread = tts.call_text2speech_async("Hello, world!")
thread.join() # Wait for speech playback to completeCreate a config.yaml file:
audio:
output_device: null # null = system default
default_volume: 0.8
sample_rate: 24000
tts:
engine: "kokoro"
kokoro:
lang_code: "a" # 'a' = American, 'b' = British
voice: "af_heart" # See voice options below
speed: 1.0
logging:
verbose: false
log_level: "INFO"
performance:
use_gpu: trueThen use it:
from text2speech import Text2Speech
tts = Text2Speech(config_path="config.yaml")
tts.speak("Configured speech!")
tts.shutdown()# Basic usage
text2speech "Hello, world!"
# With custom voice
text2speech "Hello" --voice am_adam
# With custom config
text2speech "Hello" --config my_config.yamlaf_heart- Female, warm and clear (default)af_nicole- Female, professionalam_adam- Male, deep and authoritativeam_michael- Male, friendly
bf_emma- Female, elegantbf_isabella- Female, sophisticatedbm_lewis- Male, refinedbm_george- Male, distinguished
tts = Text2Speech(el_api_key="dummy_key")
# Change voice at runtime
tts.set_voice("am_adam")
tts.speak("Speaking with Adam's voice")
# Adjust speed (0.5 to 2.0)
tts.set_speed(1.2)
# Adjust volume (0.0 to 1.0)
tts.set_volume(0.7)
tts.shutdown()The audio queue manager prevents ALSA/PortAudio device conflicts by serializing audio playback.
- Priority Queue: Urgent messages play first
- Duplicate Detection: Skips repeated messages within timeout window
- Non-blocking: Queue messages and continue execution
- Statistics Tracking: Monitor queue performance
- Automatic Cleanup: Graceful shutdown handling
tts = Text2Speech(el_api_key="dummy_key")
# Queue several messages
tts.speak("Message 1")
tts.speak("Message 2")
tts.speak("Urgent!", priority=10)
# Check statistics
stats = tts.get_queue_stats()
print(stats)
# {
# 'messages_queued': 3,
# 'messages_played': 1,
# 'messages_skipped_duplicate': 0,
# 'messages_skipped_full': 0,
# 'errors': 0
# }
tts.shutdown()from text2speech import Text2Speech
tts = Text2Speech(
el_api_key="dummy_key",
enable_queue=True,
max_queue_size=100, # Larger queue
duplicate_timeout=5.0 # 5 second duplicate detection window
)
tts.speak("Custom queue settings")
tts.shutdown()The main.py file contains several example use cases:
# Run all examples
python main.py
# Run with verbose output
python main.py --verbose
# Run a specific example (1-5)
python main.py --example 3
# Run interactive mode
python main.py --interactive- Simple Greeting - Basic TTS demonstration
- Multiple Sentences - Sequential speech generation
- Multilingual - Speaking in different languages
- Long Text - Handling longer passages
- Interactive Mode - User input to speech
See TESTING.md.
User Input → Text2Speech → AudioQueueManager → Worker Thread →
Kokoro Model → Audio Tensor → Resampling → Volume Normalization →
Audio Playback
- Text2Speech: Main class coordinating TTS operations
- AudioQueueManager: Thread-safe priority queue for audio playback
- Config: YAML-based configuration management
- Kokoro Pipeline: Speech synthesis engine (82M parameters)
- Audio Processing: Resampling and normalization
- Safe Playback: Thread-safe audio output with error handling
from text2speech import Text2Speech
# Robot voice
robot_tts = Text2Speech(el_api_key="dummy_key")
robot_tts.set_voice("am_adam")
robot_tts.set_speed(1.1)
# Narrator voice
narrator_tts = Text2Speech(el_api_key="dummy_key")
narrator_tts.set_voice("bm_lewis")
narrator_tts.set_speed(0.95)
robot_tts.speak("I am a robot.")
narrator_tts.speak("The narrator speaks.")
robot_tts.shutdown()
narrator_tts.shutdown()from text2speech import Text2Speech
with Text2Speech(el_api_key="dummy_key") as tts:
tts.speak("Automatic cleanup!")
# Shutdown called automaticallyModify the _text2speech_kokoro method to change voice characteristics:
tts.set_voice('af_heart') # Change voice
tts.set_speed(1.2) # Adjust speed (0.5 - 2.0)
tts.set_volume(0.8) # Adjust volume (0.0 - 1.0)The project uses several tools to maintain code quality:
# Format code with Black
black .
# Lint with Ruff
ruff check .
# Type checking with mypy
mypy text2speech --ignore-missing-imports
# Security scanning with Bandit
bandit -r text2speech/Install pre-commit hooks for automatic code quality checks:
pip install pre-commit
pre-commit installThe project includes GitHub Actions workflows for:
- 🔍 Code quality checks (Ruff, Black, mypy)
- 🧪 Automated testing across multiple Python versions and OS
- 🔒 Security scanning (CodeQL, Bandit)
- 📦 Dependency review
- 🚀 Automated releases
See troubleshooting.md.
- Python: 3.9 or higher
- Operating Systems: Ubuntu, Windows, macOS
- Audio: System with audio output device
- Memory: Minimum 2GB RAM recommended, 4GB for optimal performance
- Disk Space: ~500MB for model files
- GPU (optional): CUDA-capable GPU for faster inference
This project is licensed under the MIT License - see the LICENSE file for details.
- Kokoro-82M: For providing the excellent open-source TTS model (Apache 2.0)
- PyTorch: For the deep learning framework
- sounddevice: For audio playback capabilities
- ElevenLabs: For initial inspiration (legacy support)
Daniel Gaida
Email: daniel.gaida@th-koeln.de
GitHub: @dgaida
- Audio queue manager for conflict-free playback
- YAML configuration system
- Command-line interface
- Add support for custom voice models
- Implement audio caching for repeated phrases
- Support for SSML (Speech Synthesis Markup Language)
- Real-time streaming TTS
- Voice cloning capabilities
- Web API endpoint for remote TTS
- Docker containerization
- Plugin system for custom audio processors