Skip to content

🔊 PiperGen: Pretrain Piper TTS with OpenAI voices! Capture, convert, and fine-tune models for rich, natural speech synthesis. 🗣✨

Notifications You must be signed in to change notification settings

theboringhumane/openai-voices.piper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

🎤 PiperGen - Pretrained TTS with OpenAI's Voices

Research Paper: Pretraining Piper TTS with OpenAI’s Text-to-Speech Voices


Abstract

This project explores training the open-source text-to-speech (TTS) system Piper with pre-trained audio generated from OpenAI's TTS model. By leveraging OpenAI’s high-quality voice outputs, we develop a pretrained model that improves Piper’s voice synthesis quality and versatility. Key steps include capturing OpenAI TTS outputs, converting audio formats, fine-tuning Piper with multimodal training, and optimizing the final model for deployment. This research paper presents a comprehensive methodology for constructing robust and reusable TTS models with OpenAI audio as a baseline.

Keywords

Piper, TTS, Text-to-Speech, OpenAI, Model Training, Pre-trained Models, Audio Conversion, Multimodal Training, PyTorch


1. Introduction

Text-to-Speech (TTS) technologies have expanded across various fields, from virtual assistants to accessibility tools, where natural, expressive, and accurate speech synthesis is critical. Piper, an open-source TTS engine, is highly customizable and can be fine-tuned for specific audio datasets. This project utilizes OpenAI’s TTS model outputs to pretrain Piper, aiming to bridge the quality of OpenAI's TTS voices with the flexibility of Piper's open-source framework. The research contributes an in-depth process for capturing OpenAI-generated audio, preprocessing, and training Piper models that can benefit the TTS research community.


2. Methodology

The project involved the following systematic steps:

2.1 Data Collection from OpenAI’s TTS Model

Using OpenAI’s TTS API, audio files were generated directly from the text content in AAC format. The TTS data was streamed locally with curl commands that allowed capturing and saving these audio outputs for further training. This process is particularly beneficial as OpenAI’s TTS voices provide high fidelity and natural prosody.

  • Implementation: Custom scripts using curl fetched streaming audio files in response to OpenAI’s TTS model queries.
  • Output: Saved audio in .aac format for the next step in audio preprocessing.

2.2 Audio Format Conversion

Piper requires audio files in .wav format with a 16,000 Hz sampling rate for effective training. We used ffmpeg, a powerful multimedia framework, to convert these .aac files into .wav format, ensuring the output was compatible with Piper’s input requirements.

  • Command:
    ffmpeg -i input.aac -ar 16000 output.wav
  • Outcome: Consistent audio format ready for training, preventing format compatibility issues.

2.3 Preprocessing for Consistency with librosa

During testing, inconsistencies in the audio bitrate surfaced, potentially hampering the model’s performance. To address this, we applied librosa library functions to reprocess the audio files, standardizing them to a 16,000 Hz bitrate within the training script itself. This step prevents training interruptions due to format mismatch.

  • Code Snippet:
    import librosa
    
    def process_audio(file_path):
        y, sr = librosa.load(file_path, sr=16000)
        librosa.output.write_wav(file_path, y, sr)
  • Result: Standardized audio input for training.

2.4 Model Fine-tuning and Training with Piper

With the processed data, we used Piper’s multi-model training setup within a PyTorch Jupyter notebook to fine-tune and train the TTS model. Training settings were customized to balance processing time and output quality, allowing the model to capture OpenAI’s expressive nuances in synthesized speech.

  • Configurations: Adjusted batch size, epochs, learning rate, and other hyperparameters to optimize resource use while maintaining high fidelity in model output.

2.5 Output Model Generation

Upon training completion, the model was saved in a format compatible with Piper’s TTS engine, allowing further integration and testing in TTS applications.


3. Experimental Setup

To benchmark the model’s effectiveness, we established a controlled environment using:

  • Hardware: GPU-supported cloud instances for faster processing during training.
  • Software: Python, PyTorch, Piper multimodel notebook, and audio-processing libraries like librosa and ffmpeg.
  • Data Evaluation: Both qualitative and quantitative assessments were conducted on the output to gauge its similarity to OpenAI’s TTS quality.

4. Results

The pretrained Piper model demonstrated significant improvements in voice quality and expressiveness. By using OpenAI-generated audio as a training base, the output was markedly more natural and consistent in prosody and tone.

  • Audio Quality: Pretrained Piper model achieved high intelligibility and naturalness, particularly in tonal consistency.
  • Performance: The model was efficient, performing real-time synthesis with minimal latency.

5. Discussion

This method showcases the potential of leveraging high-quality commercial TTS outputs as training data for open-source TTS models. However, challenges remain in ensuring format compatibility and optimizing training configurations. Future work may focus on automating format conversion and exploring transfer learning methods to reduce training duration.


6. Conclusion

This project successfully pretrained the Piper TTS model with OpenAI’s TTS audio, achieving high-quality speech synthesis. The integration of OpenAI audio with Piper’s open-source flexibility is a step forward in democratizing access to high-quality TTS systems, making this methodology valuable for developers aiming to build high-fidelity TTS applications without extensive resources.


Documentation for the Project


Project Structure

  • data/: Contains preprocessed .wav files.
  • notebooks/: Includes the Jupyter notebook for Piper training.
  • scripts/: Contains utility scripts for data collection and preprocessing.
  • output/: Stores the pretrained model files.

Step-by-Step Guide

  1. Data Collection:

    • Run curl commands to capture TTS audio from OpenAI, saving them in .aac format.
  2. Audio Conversion:

    • Use ffmpeg to convert .aac files to .wav with a 16,000 Hz sampling rate.
  3. Data Preprocessing:

    • Run the librosa script to standardize all audio files to a consistent bitrate of 16,000 Hz.
  4. Model Training:

    • Open the piper-multimodel-training.ipynb Jupyter notebook.
    • Adjust the model configuration settings and initiate training.
  5. Export Model:

    • After training, export the model in Piper-compatible format, placing it in the output/ folder.

Dependencies

  • Python (>=3.8)
  • PyTorch
  • Piper
  • ffmpeg
  • librosa

Run Instructions

  1. Install all dependencies: pip install -r requirements.txt.
  2. Preprocess audio with ffmpeg and librosa.
  3. Train the model using the provided Jupyter notebook.
  4. Deploy the trained model with the Piper TTS engine.

Notes

  • Ensure GPU availability for optimal training speed.
  • Verify audio format compatibility before training to avoid model errors.

About

🔊 PiperGen: Pretrain Piper TTS with OpenAI voices! Capture, convert, and fine-tune models for rich, natural speech synthesis. 🗣✨

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published