This project provides a distributed batch audio transcription system using whisper.cpp for fast, local speech-to-text. It consists of a client (audio processor/uploader) and a server (task distributor/collector). The system is designed for Linux (Fedora recommended), but can be adapted for Mac and Windows.
The server has access to a directory holding mp3 files and exposes 3 endpoints. The first endpoint when queried by the client finds an mp3 file without an accompanying vtt file. The language to be used for transcription is defined in the accompanying json file under sql_params/language (two letter ISO code). The client then transcodes the mp3 file into wav and runs inference via whisper.cpp. The client doesn't know the original filename, only a unique ID is provided whereby the job is tracked by the server in a .txt file (simple database). When the client is done processing, or encounters an error, it needs to send a POST request to one of the other endpoints, containing the ID of the job, and the server removes it from the db. The client writes all performed jobs to a csv file with some basic data that can be used to generate statistics for the client.
The client downloads audio tasks from the server, transcribes them using whisper.cpp, and uploads the results. It is intended to run on Linux (Fedora), but can be used on Mac and Windows with some manual steps.
- Python 3.8+
- ffmpeg (must be installed and available in your system PATH)
- requests Python package (
pip install requests) - whisper.cpp binary (see below)
Copy .env.example to .env and edit the values for your server and credentials:
cp client/.env.example client/.env
# Edit client/.env as neededThe client supports cron scheduling. The ENV variable CRON can be set to a cron schedule that denotes a start time. THe variable PROCESSING_HOURS is a value in hours that denotes for how long the client will be transcriing for since starting.
If you are not running Fedora, you can use Distrobox to create a Fedora container for a clean build environment.
Run the provided script to install dependencies and build whisper.cpp:
cd client/whisper
chmod +x build-linux_fedora.sh
./build-linux_fedora.shFollow the prompts to select your backend (default: CPU). The script will build whisper.cpp and download the required model.
The resulting binary will be at:
client/whisper/whisper.cpp/build/bin/whisper-cli
If you prefer, follow the whisper.cpp instructions to build manually.
- Follow instructions in more detail: whisper.cpp.
# Apple Silicon
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config ReleaseYou might need to have Xcode and cmake
brew install cmake - Place the binary as
client/whisper/whisper.cpp/build/bin/whisper-cli(create the folder if needed). - Ensure
ffmpegis installed (e.g., via Homebrew:brew install ffmpeg).
- Download the latest whisper.cpp binary for Windows from whisper.cpp releases.
- Place the binary as
client/whisper/whisper.cpp/build/bin/whisper-cli(create the folder if needed). - Ensure
ffmpegis installed and available in your system PATH.
After building or downloading the binary and setting up your .env, run:
python client/client.pyThe client will poll the server for tasks, process them, and upload results. If authentication is enabled, set AUTH_ENABLED=true and provide USERNAME and PASSWORD in your .env.
For a containerized client setup that works anywhere, use Docker or Podman. This provides isolation and reproducible builds across different systems.
-
Prepare your environment:
cd client cp .env.example .env # Edit .env with your server configuration
-
Build the Docker image:
chmod +x build.sh run.sh ./build.sh
- Select your container runtime (docker or podman)
- Select your backend (cpu, vulkan, cuda, or openvino)
- For Intel GPUs 13th gen or older with Vulkan, select 'yes' when prompted
-
Run the container:
./run.sh
- Select your container runtime (docker or podman)
- Select the backend to match your build
- The container will start and mount your
.envfile
processed_uploaded/– Successfully uploaded VTT filesprocessed_not_uploaded/– Processed but not yet uploaded filesnot_processed_failed_report/– Failed processing reportsprocessed.csv– Processing log (auto-created with headers if missing).env– Your environment configuration
# Docker
docker logs -f distributed-batch-stt-client-<backend>
# Podman
podman logs -f distributed-batch-stt-client-<backend># Stop
docker stop distributed-batch-stt-client-<backend>
podman stop distributed-batch-stt-client-<backend>
# Start
docker start distributed-batch-stt-client-<backend>
podman start distributed-batch-stt-client-<backend>
# Remove
docker rm -f distributed-batch-stt-client-<backend>
podman rm -f distributed-batch-stt-client-<backend>- cpu – Works everywhere, uses OpenBLAS acceleration (slowest)
- vulkan – AMD/Intel/NVIDIA GPUs, requires
/dev/dridevice (fastest for compatible GPUs) - cuda – NVIDIA GPUs only, requires nvidia-docker or proper GPU passthrough (very fast)
- openvino – Intel CPUs/GPUs, requires
/dev/dridevice for GPU acceleration
See DOCKER_README.md for detailed Docker/Podman documentation.
The server is a FastAPI app that distributes audio tasks and collects results. It is designed to run in a Docker container for easy deployment and persistence.
-
Build the Docker image:
cd server docker build -t whisper-server . # or with Podman: # podman build -t whisper-server .
-
Run the container, mapping volumes for persistence:
docker run -d \ -p 8000:8000 \ -v /path/to/logs:/app/logs:Z \ -v /path/to/inprogress.txt:/app/inprogress.txt:Z \ -v /mnt/data/video:/mnt/data/video:Z \ whisper-server
- Replace
/mnt/data/videowith the actual path to your MP3 files. - Map
logsandinprogress.txtto host locations to avoid data loss when rebuilding the container. podman example:
podman run -d -p 8000:8000 --replace --restart=always --name=whsiper-server -v ./inprogress.txt:/app/inprogress.txt:Z -v ./processed.csv:/app/processed.csv:Z -v /home/shared/video:/mnt/data/video:Z -v ./logs:/app/logs:Z whisper-server
- Replace
-
Environment Variables:
- Copy
.env.exampleto.envand edit as needed:cp server/.env.example server/.env # Edit server/.env - Set
AUDIO_DIRto the directory containing your MP3 files.
- Copy
-
Access the API:
- The server listens on port 8000 by default.
- Endpoints:
GET /task– Get a new audio taskPOST /result– Submit a completed transcriptionPOST /error– Report a failed task
Note: The server does not implement authentication itself. It is recommended to run the server behind a reverse proxy (e.g., Caddy, Nginx) that enforces BASIC AUTH or other authentication at the proxy level. Failing to do so exposes the server to accepting any gibberish files.
- ffmpeg must be installed and available in your system PATH for the client to convert audio files.
- The client and server communicate using HTTP. Ensure network connectivity between them.
- For GPU acceleration, build whisper.cpp with the appropriate backend (CUDA, Vulkan, OpenVINO). See the build script for options.
- For troubleshooting, check the logs in the mapped
logsdirectory on the server.