This project provides a set of Python scripts to create a simple, powerful, and private transcription pipeline. It uses a local AI model (via an OpenAI-compatible server) to transcribe audio and video files and then compiles the resulting text files into a single, chapterized PDF document.
This approach is ideal for users who want to transcribe sensitive media or avoid the costs associated with cloud-based transcription services.
- Local First: All processing is done locally. Your files are never sent to a third-party service.
- Broad Format Support: Transcribes common audio (
.mp3,.wav,.m4a) and video (.mp4,.mkv,.mov) formats. - Automatic Chapter Generation: Each media file is treated as a chapter in the final PDF, with the filename used as the chapter title.
- Configurable: Easily change the target folder, API endpoint, and AI model in the scripts.
- Unicode Support: Includes support for DejaVu fonts to correctly render a wide range of characters in the PDF.
- Python 3.6+
- Project Dependencies:
openaifpdf2
- A Local AI Server: You must have a local application running that serves a transcription model (like Whisper) through an OpenAI-compatible API endpoint.
- Examples: Speaches
- (Optional) DejaVu Fonts: For the best PDF output with full character support, download the DejaVu fonts and place
DejaVuSans.ttfandDejaVuSans-Bold.ttfin the root directory of this project.
-
Clone the Repository:
git clone <repository-url> cd <repository-directory>
-
Install Python Dependencies:
pip install -r requirements.txt
-
Set Up Your Local AI Server:
- Launch your local AI server (e.g., LM Studio).
- Load a transcription model (e.g., a GGUF version of Whisper).
- Start the server and note the URL of the local API endpoint (e.g.,
http://localhost:1234/v1).
-
Configure the Scripts:
- Open
transcribe_audio_script.pyand/ortranscribe_video_script.py. - Set the
API_BASE_URLto match your local server's address. - (If needed) Update the
MODEL_NAMEto match the model identifier used by your server.
- Open
The pipeline is designed to be run in three main steps:
Create a folder named to_transcribe in the project's root directory (the scripts will create it for you on the first run). Place all the audio or video files you want to transcribe into this folder.
.
├── to_transcribe/
│ ├── 01_introduction.mp4
│ ├── 02_main_topic.mp3
│ └── 03_conclusion.wav
├── transcribe_video_script.py
├── txt_to_pdf_chapters_script.py
└── ...
Run the appropriate script to turn your media files into text files. The script will process each file in the to_transcribe folder and save a corresponding .txt file in the same location.
- For both audio and video files:
python transcribe_video_script.py
- For audio-only files:
python transcribe_audio_script.py
After this step, your to_transcribe folder will look like this:
to_transcribe/
├── 01_introduction.mp4
├── 01_introduction.txt
├── 02_main_topic.mp3
├── 02_main_topic.txt
├── 03_conclusion.wav
└── 03_conclusion.txt
Run the PDF generation script. It will find all the .txt files, sort them alphabetically (which is why numbering them is helpful), and combine them into a single PDF named output.pdf.
python txt_to_pdf_chapters_script.pyThe final output.pdf will be saved in the root directory of the project. Each chapter will be titled based on the original filename (e.g., "01 Introduction").